Full text of "Introductory Statistics 7th Ed P. Mann ( Wiley, 2010) WW"

See other formats

KEY FORMULAS

Prem S. Mann • Introductory Statistics, Seventh Edition

Chapter 2 • Organizing and Graphing Data

• Relative frequency of a class = //X/

• Percentage of a class = (Relative frequency) X 100

• Class midpoint or mark = (Upper limit + Lower limit)/2

• Class width = Upper boundary — Lower boundary

• Cumulative relative frequency

Cumulative frequency
Total observations in the data set

• Cumulative percentage

= (Cumulative relative frequency) X 100

Chapter 3 • Numerical Descriptive Measures

• Mean for ungrouped data: jj, = %x/N and x = %x/n

• Mean for grouped data: fi = Xmf/N and x = Xmf/n
where m is the midpoint and /is the frequency of a class

• Median for ungrouped data

= Value of the middle term in a ranked data set

Range = Largest value — Smallest value
Variance for ungrouped data:

Ix 2 -

(X*) 2 ^ 2 (X*) 2

X* 2 -

and s

N n - 1

where a 2 is the population variance and s 2 is the sample
variance

Standard deviation for ungrouped data:

Xx 2

(X*) 2

X* 2 -

(X*) 2

and s -
N V n - 1

where a and s are the population and sample standard de-
viations, respectively

Variance for grouped data:

Xm 2 /-

(Xm/) 2

Xm 2 f-

(Xm/) 2

and s

n - 1

Standard deviation for grouped data:

(Xmff

Xm 2 f-

(2mff

Zm 2 f-

and s

n - 1

Chebyshev's theorem:

For any number k greater than 1, at least (1 — 1/A: 2 ) of the
values for any distribution lie within k standard deviations
of the mean.

Empirical rule:

For a specific bell-shaped distribution, about 68% of the ob-
servations fall in the interval (/i — cr) to (fi + cr), about
95% fall in the interval {pi — 2a) to (/j, + 2a), and about
99.7% fall in the interval (ji - 3a) to (fi + 3a).

Q x = First quartile given by the value of the middle term
among the (ranked) observations that are less than the
median

Q 2 = Second quartile given by the value of the middle term
in a ranked data set

Q 3 = Third quartile given by the value of the middle term
among the (ranked) observations that are greater than
the median

Interquartile range:
The &th percentile:

P t = Value of the

IQR = Qs-Qi

kn \
100

jth term in a ranked data set

Percentile rank of x,

Number of values less than x,

Total number of values in the data set

X 100

Chapter 4 • Probability

• Classical probability rule for a simple event:

P(Ej) =

Total number of outcomes

• Classical probability rule for a compound event:

. , Number of outcomes in A

P(A) =

Total number of outcomes

• Relative frequency as an approximation of probability:

P(A) = f -
n

• Conditional probability of an event:

P(A and B) P(A and B)

• Condition for independence of events:

P(A) = P(A\B) and/or P(B) = P(B\A)

• For complementary events: P(A) + P(A) = 1

• Multiplication rule for dependent events:

P(A andfi) = P(A) P(B\A)

• Multiplication rule for independent events:

P(A and B) = P(A) P(B)

• Joint probability of two mutually exclusive events:

P(A and B) =

• Addition rule for mutually nonexclusive events:

P(A or B) = P(A) + P(B) - P(A and B)

• Addition rule for mutually exclusive events:

P(A or B) = P(A) + P(B)

Population proportion: p = X/N
Sample proportion: p = x/n
Mean of p: pp = p

Standard deviation of p when n/N ^ .05: cr~ = \fpqjn
p — p

z value for p: z =

Chapter 5 * Discrete Random Variables and Their
Probability Distributions

• Mean of a discrete random variable x: p. = ~XxP(x)

• Standard deviation of a discrete random variable x:

cr = VSx 2 P(x) - p 2

• n factorial: n\ = n(n — l)(n — 2) ... 3 • 2 • 1

• Number of combinations of n items selected x at a time:

x\{n — x)\

Number of permutations of n items selected x at a time:

" x (« - x)\

• Binomial probability formula: P(x) = n C x p x q"~ x

• Mean and standard deviation of the binomial distribution:

p = np and cr = \fnpq

• Hypergeometric probability formula:

fix N-fin-x

P( X )

nC„

Poisson probability formula: P(x)

K x e

Mean, variance, and standard deviation of the Poisson prob-
ability distribution:

p = A, cr 2 = A, and cr = VX

Chapter 6 • Continuous Random Variables
and the Normal Distribution

z value for an x value: z

x — p

• Value of x when p, cr, and z are known: x = p + zcr

Chapter 7 • Sampling Distributions

• Mean of x : p- x = p

• Standard deviation of x when n/N ^ .05: <x T = cr/wn

_ x — p

• z value for x: z =

(Tt

Chapter 8 • Estimation of the Mean and Proportion

• Point estimate of p = x

• Confidence interval for p using the normal distribution
when cr is known:

x ± zcr- where cr s = a/vn

• Confidence interval for p using the t distribution when cr is
not known:

x ± ts x where s x = s/vn

• Margin of error of the estimate for p:

E = zcr- x or ts- x

• Determining sample size for estimating p:

n = z 2 cr 2 /E 2

• Confidence interval for p for a large sample:

p ± zsp where Sp = \/pq/ n
Margin of error of the estimate for p:

E = zsp, where Sp = 'Vpq/n

Determining sample size for estimating p:

n = z 2 pq/E 2

Chapter 9 • Hypothesis Tests about the Mean
and Proportion

• Test statistic z for a test of hypothesis about p using the
normal distribution when cr is known:

x - p cr
z = where ctt: =

Test statistic for a test of hypothesis about p using the t dis-
tribution when cr is not known:

x - p s
t = where = — 1=

s x Vn

Test statistic for a test of hypothesis about p for a large
sample:

P ~ P

z = where cr fl =

Chapter 10 * Estimation and Hypothesis Testing:
Two Populations

• Mean of the sampling distribution of x x — x 2 .

= Ml _ M*2

• Confidence interval for p, x — /x 2 for two independent
samples using the normal distribution when cr, and cr 2 are
known:

la] a\

\*\ ~ x 2> ± Z°Vi 2 where ^-i, = \\— + —

Test statistic for a test of hypothesis about /a, — /a 2 for two
independent samples using the normal distribution when cr,
and £r 2 are known:

(ii - x 2 ) - (mi - M2)

z =

For two independent samples taken from two populations
with equal but unknown standard deviations:

Pooled standard deviation:

(n x - l)sf + (n 2 - 1)4
V »i + n 2 — 2

Estimate of the standard deviation of x x — x 2 :

1 1

— + —

n x n 2

Confidence interval for j± x — /jl 2 using the t distribution:

(x x - x 2 ) ± ts 7xi ^
Test statistic using the t distribution:

t = {x\ ~ xi) ~ (Mi ~ M2)

For two independent samples selected from two populations
with unequal and unknown standard deviations:

'2 2\2

s i + fi)

Degrees of freedom: df = / 2 . 2 , 2 . 2

■12

Estimate of the standard deviation of x x — x 2 .

~2 2

n, n 2

Confidence interval for fi l — /jl 2 using the t distribution:

(jcj - x 2 ) ± ts- x ^- Xi
Test statistic using the t distribution:

(jc, - x 2 ) ~ (Mi - M2)

For two paired or matched samples:
Sample mean for paired differences: d = Xd/n
Sample standard deviation for paired differences:

V n — 1

Mean and standard deviation of the sampling distribution
of d:

MS = Mrf and s~ d = s d /Vn

Confidence interval for /j, d using the t distribution:

d ± ts d where s~, = s d /vn

Test statistic for a test of hypothesis about jj, d using the t
distribution:

For two large and independent samples, confidence interval
for pi - p 2 :

PiQi P2Q2

(Pi ~ P2) ± zs Pl -p 2 where s A _ A = yj — + —

For two large and independent samples, for a test of
hypothesis about p x — p 2 with H : p x — p 2 = 0:

Pooled sample proportion:

_ x x + x 2 n x p x + n 2 p 2

p = — — — or

«i + n 2 n x + n 2

Estimate of the standard deviation of p x — p 2 .

\ — + —
1 n x n 2

Test statistic: z

{P\ ~ P2) ~ (Pi ~ P2)

Chapter 1 1 • Chi-Square Tests

• Expected frequency for a category for a goodness-of-fit
test:

E = np

• Degrees of freedom for a goodness-of-fit test:

df = k — 1 where k is the number of categories

• Expected frequency for a cell for an independence or ho-
mogeneity test:

(Row total)(Column total)

E =

Sample size

• Degrees of freedom for a test of independence or
homogeneity:

df=(R- 1)(C- 1)
where R and C are the total number of rows and columns,
respectively, in the contingency table

Test statistic for a goodness-of-fit test and a test of inde-
pendence or homogeneity:

x 2 = V-

Confidence interval for the population variance a 2

(re - \)s 2 (re - l)s 2
2 to 2

Xa/2 XT -a/2

Test statistic for a test of hypothesis about a 2 :

2 in - iy

X' = ~2

Chapter 12 • Analysis of Variance

Let:

k = the number of different samples
(or treatments)
the size of sample i
the sum of the values in sample ;
the number of values in all samples

n \ + n 2 + n 3 + ' ' '

the sum of the values in all samples

r, + t 2 + t 3 + ■ ■ ■

the sum of the squares of values in all samples
For the F distribution:

Degrees of freedom for the numerator = k — 1
Degrees of freedom for the denominator = n — k
Between-samples sum of squares:

T,
n

2* 2

SSB

3 | ) (M

Within-samples sum of squares:

SSW = Xx 2

+ — + •■

Sum of squares of xy, xx, and yy:
(Xx)(Sy)

SS„ = Xxy

ss,.

(Xx)

and SS,

Xy :

(Xy) 2

Least squares estimates of A and B:

b = SS„,/SS xt and a = y — bx
Standard deviation of the sample errors:

S S yy b S S ™

re - 2

• Error sum of squares: SSE = Se 2 = X(y — y) 2

(Sy) 2

• Total sum of squares: SST = 2y 2

• Regression sum of squares: SSR = SST — SSE

• Coefficient of determination: r 2 = b SS rv /SS vv

• Confidence interval for B:

b ± ts b where s h = s f /VSS„

• Test statistic for a test of hypothesis about B: t =

S S A - V

• Linear correlation coefficient: r = , =

SS VV

• Test statistic for a test of hypothesis about p:

Confidence interval for /jb y \ x :
y ± tss where sa
Prediction interval for y p :
y ± ts<s where s$ =

1 (x -x) 2

1 + I + ^ ~ X Y
n SS„

(Xx)

• Total sum of squares:

SST = SSB + SSW = %:

• Variance between samples: MSB = SSB/(£ — 1)

• Variance within samples: MSW = SSW/(n — k)

• Test statistic for a one-way ANOVA test:
F = MSB/MSW

Chapter 13 * Simple Linear Regression

• Simple linear regression model: y = A + Bx + e

• Estimated simple linear regression model: y = a + bx

Chapter 14 • Multiple Regression

Formulas for Chapter 14 along with the chapter are on the
Web site for the text.

Chapter 15 • Nonparametric Methods

Formulas for Chapter 15 along with the chapter are on the
Web site for the text.

Table IV Standard Normal Distribution Table

The entries in this table give the
cumulative area under the standard
normal curve to the left of z with the
values of z equal to or negative.

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

-3.4

.0003

.0002

-3.3

.0005

.0004

.0003

-3.2

.0007

.0006

.0005

-3.1

.0010

.0009

.0008

.0007

-3.0

.0013

.0012

.0011

.0010

-2.9

.0019

.0018

.0017

.0016

.0015

.0014

-2.8

.0026

.0025

.0024

.0023

.0022

.0021

.0020

.0019

-2.7

.0035

.0034

.0033

.0032

.0031

.0030

.0029

.0028

.0027

.0026

-2.6

.0047

.0045

.0044

.0043

.0041

.0040

.0039

.0038

.0037

.0036

-2.5

.0062

.0060

.0059

.0057

.0055

.0054

.0052

.0051

.0049

.0048

-2.4

.0082

.0080

.0078

.0075

.0073

.0071

.0069

.0068

.0066

.0064

-2.3

.0107

.0104

.0102

.0099

.0096

.0094

.0091

.0089

.0087

.0084

-2.2

.0139

.0136

.0132

.0129

.0125

.0122

.0119

.0116

.0113

.0110

-2.1

.0179

.0174

.0170

.0166

.0162

.0158

.0154

.0150

.0146

.0143

-2.0

.0228

.0222

.0217

.0212

.0207

.0202

.0197

.0192

.0188

.0183

-1.9

.0287

.0281

.0274

.0268

.0262

.0256

.0250

.0244

.0239

.0233

-1.8

.0359

.0351

.0344

.0336

.0329

.0322

.0314

.0307

.0301

.0294

-1.7

.0446

.0436

.0427

.0418

.0409

.0401

.0392

.0384

.0375

.0367

-1.6

.0548

.0537

.0526

.0516

.0505

.0495

.0485

.0475

.0465

.0455

-1.5

.0668

.0655

.0643

.0630

.0618

.0606

.0594

.0582

.0571

.0559

-1.4

.0808

.0793

.0778

.0764

.0749

.0735

.0721

.0708

.0694

.0681

-1.3

.0968

.0951

.0934

.0918

.0901

.0885

.0869

.0853

.0838

.0823

-1.2

.1151

.1131

.1112

.1093

.1075

.1056

.1038

.1020

.1003

.0985

-1.1

.1357

.1335

.1314

.1292

.1271

.1251

.1230

.1210

.1190

.1170

-1.0

.1587

.1562

.1539

.1515

.1492

.1469

.1446

.1423

.1401

.1379

-0.9

.1841

.1814

.1788

.1762

.1736

.1711

.1685

.1660

.1635

.1611

-0.8

.2119

.2090

.2061

.2033

.2005

.1977

.1949

.1922

.1894

.1867

-0.7

.2420

.2389

.2358

.2327

.2296

.2266

.2236

.2206

.2177

.2148

-0.6

.2743

.2709

.2676

.2643

.2611

.2578

.2546

.2514

.2483

.2451

-0.5

.3085

.3050

.3015

.2981

.2946

.2912

.2877

.2843

.2810

.2776

-0.4

.3446

.3409

.3372

.3336

.3300

.3264

.3228

.3192

.3156

.3121

-0.3

.3821

.3783

.3745

.3707

.3669

.3632

.3594

.3557

.3520

.3483

-0.2

.4207

.4168

.4129

.4090

.4052

.4013

.3974

.3936

.3897

.3859

-0.1

.4602

.4562

.4522

.4483

.4443

.4404

.4364

.4325

.4286

.4247

0.0

.5000

.4960

.4920

.4880

.4840

.4801

.4761

.4721

.4681

.4641

(continued on next page)

Table IV Standard Normal Distribution Table (continued from previous page)

The entries in this table give the
cumulative area under the standard
normal curve to the left of z with the
values of z equal to or positive.

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

0.0

.5000

.5040

.5080

.5120

.5160

.5199

.5239

.5279

.5319

.5359

5398

5438

5478

5517

5557

5596

5636

5675

.5714

5753

5793
.«j i y o

5832

5871

5910

5948

5987

6026

6064

6103

.6141

6179

.6217

6255

6?93

633 1

6368

6406

6443

6480

651 7

0.4

.6554

.6591

.6628

.6664

.6700

.6736

.6772

.6808

.6844

.6879

0.5

.6915

.6950

.6985

.7019

.7054

.7088

.7123

.7157

.7190

.7224

0.6

.7257

.7291

.7324

.7357

.7389

.7422

.7454

.7486

.7517

.7549

7580

.761 1

.7642

7673

.7704

.7734

.7764

7794

7823

785?

7881

791

■ /71U

7939

. / y oy

. i y\j i

7995

. / yy

80?3

8051

8078

8106

. O 1 \J\J

8 1 33

.Ol JJ

0.9

.8159

.8186

.8212

.8238

.8264

.8289

.8315

.8340

.8365

.8389

1.0

.8413

.8438

.8461

.8485

.8508

.8531

.8554

.8577

.8599

.8621

1.1

.8643

.8665

.8686

.8708

.8729

.8749

.8770

.8790

.8810

.8830

1.2

8849

8869

8888

8907

89? 5

8944

896?

8980

8997
. oyy i

901 5

1 3

903?

9049

9066

908?

9099

. y\jyy

9115

.711J

9131

9147

916?

. y 1 V.;z_

9177

.y i / /

1.4

.9192

.9207

.9222

.9236

.9251

.9265

.9279

.9292

.9306

.9319

1.5

.9332

.9345

.9357

.9370

.9382

.9394

.9406

.9418

.9429

.9441

1 f<

1 .u

Q/LS?

.yHOZ

.7tUJ

QA1A

QSOS
.yjuj

.yj i j

.7JZJ

.yjjj

.yjH-j

1 7

.7JOA

. yjy i

. y~>yy

.7UUO

. "O 1 u

Q69S

. 7UiJ

1 8
1 .0

.yojo

.yOOH

Q67 1
. yo / 1

.yo / o

.yooo

.7\J7J

.yoyy

.y / uo

1.9

.9713

.9719

.9726

.9732

.9738

.9744

.9750

.9756

.9761

.9767

2.0

.9772

.9778

.9783

.9788

.9793

.9798

.9803

.9808

.9812

.9817

Z. 1

Q89 1

.yozi

.yozo

.yoju

.yo

.yo jo

.yo^-z

.yo'+O

.yoju

QR^zL

.yoj^f

QR^7

.yoj /

9 9

z.z

.yooi

.yoo ,: f

.yooo

QR7 1

.yo / 1

.yo / J

.yo /o

.yoo 1

QRRzL
.yoo^-

QRR7

.yoo /

.yoyu

2.3

.9893

.9896

.9898

.9901

.9904

.9906

.9909

.9911

.9913

.9916

2.4

.9918

.9920

.9922

.9925

.9927

.9929

.9931

.9932

.9934

.9936

2.5

.9938

.9940

.9941

.9943

.9945

.9946

.9948

.9949

.9951

.9952

2.6

.9953

.9955

.9956

.9957

.9959

.9960

.9961

.9962

.9963

.9964

2.7

.9965

.9966

.9967

.9968

.9969

.9970

.9971

.9972

.9973

.9974

2.8

.9974

.9975

.9976

.9977

.9978

.9979

.9980

.9981

2.9

.9981

.9982

.9983

.9984

.9985

.9986

3.0

.9987

.9988

.9989

.9990

3.1

.9990

.9991

.9992

.9993

3.2

.9993

.9994

.9995

3.3

.9995

.9996

.9997

3.4

.9997

.9998

This is Table IV of Appendix C.

Table V The t Distribution Table

The entries in this table give the critical values
of t for the specified number of degrees
of freedom and areas in the right tail.

o t

Area in the Right Tail under the t Distribution Curve

.10

.05

.025

.01

.005

.001

3.078

6.314

12.706

31.821

63.657

318.309

1.886

2.920

4.303

6.965

9.925

22.327

1.638

2.353

3.182

4.541

5.841

10.215

1.533

2.132

2.776

3.747

4.604

7.173

1.476

2.015

2.571

3.365

4.032

5.893

1.440

1.943

2.447

3.143

3.707

5.208

1.415

1.895

2.365

2.998

3.499

4.785

1.397

1.860

2.306

2.896

3.355

4.501

1.383

1.833

2.262

2.821

3.250

4.297

1.372

1.812

2.228

2.764

3.169

4.144

1.363

1.796

2.201

2.718

3.106

4.025

1.356

1.782

2.179

2.681

3.055

3.930

1.350

1.771

2.160

2.650

3.012

3.852

1.345

1.761

2.145

2.624

2.977

3.787

1.341

1.753

2.131

2.602

2.947

3.733

1.337

1.746

2.120

2.583

2.921

3.686

1.333

1.740

2.110

2.567

2.898

3.646

1.330

1.734

2.101

2.552

2.878

3.610

1.328

1.729

2.093

2.539

2.861

3.579

1.325

1.725

2.086

2.528

2.845

3.552

1.323

1.721

2.080

2.518

2.831

3.527

1.321

1.717

2.074

2.508

2.819

3.505

1 319

1 714

1. / It

2 069

? 500

2 807

3 485

1.318

1.711

2.064

2.492

2.797

3.467

1.316

1.708

2.060

2.485

2.787

3.450

1.315

1.706

2.056

2.479

2.779

3.435

1.314

1.703

2.052

2.473

2.771

3.421

1.313

1.701

2.048

2.467

2.763

3.408

1.311

1.699

2.045

2.462

2.756

3.396

1.310

1.697

2.042

2.457

2.750

3.385

1.309

1.696

2.040

2.453

2.744

3.375

1.309

1.694

2.037

2.449

2.738

3.365

1.308

1.692

2.035

2.445

2.733

3.356

1.307

1.691

2.032

2.441

2.728

3.348

1.306

1.690

2.030

2.438

2.724

3.340

(continued on

next page)

36
37
38
39
40

42
43
44
45

46
47
48
49
50

51
52
53
54
55

56
57
58
59
60

61
62
63
64
65

66
67
68
69
70

72
73
74
75

The f Distribution Table (continued from previous page)

Area in the Right Tail under the t Distribution Curve

.10

.025

.01

.005

.001

.306

1.688

2.028

2.434

2.719

3.333

.305

1.687

2.026

2.431

2.715

3.326

.304

1.686

2.024

2.429

2.712

3.319

o A/i
.304

1.685

O AO*}

2.023

O A 0£

2.426

O 7AO

2. /08

O O 1 o

3.313

.303

1 £ o A

1.684

O AO 1

2.021

O A OO

2.423

o nf\A
2. /04

O OAT

3.30/

.303

1.683

2.020

2.421

2.701

3.301

.302

1.682

2.018

2.418

2.698

3.296

.302

1.681

2.017

2.416

2.695

3.291

OA 1

.301

1 £ OA

1.680

O A 1 C

2.015

O A 1 A

Z.414

o £no
2.692

O O 0£

3.286

OA 1

.301

1.6 /9

O A 1 /I

2.014

O A 1 O

2.412

O £AA

2.690

O O 1

3.281

.300

1.679

2.013

2.410

2.687

3.277

.300

1.678

2.012

2.408

2.685

3.273

.299

1.677

2.011

2.407

2.682

3.269

OAA

.299

i fin
1.0/ /

O A 1 A

2.010

O A AC

2.405

O £ OA

2.680

0£ c

3.265

OAA

.299

1.6 lb

O AAA

2.009

O A AO

2.403

2.6/8

O O^ 1

3.261

.298

1.675

2.008

2.402

2.676

3.258

.298

1.675

2.007

2.400

2.674

3.255

.298

1.674

2.006

2.399

2.672

3.251

om
.29 /

1 en a
1.6 /4

O AAC

2.005

O OAT

2.39/

2.6/0

O /I o

3.248

om
.29/

1 ^to
1.6/3

O A A A

2.004

O A£

2.396

O o

2.668

3.245

.297

1.673

2.003

2.395

2.667

3.242

.297

1.672

2.002

2.394

2.665

3.239

.296

1.672

2.002

2.392

2.663

3.237

OA£

.296

1.6/1

O AA 1

2.001

O A 1

2.391

O ££0

2.662

O O O /I

3.234

.296

1.6/1

O AAA
2.000

O OAA

2.390

O ££A

2.660

o oo o
5.151

.296

1.670

2.000

2.389

2.659

3.229

.295

1.670

1.999

2.388

2.657

3.227

.295

1.669

1.998

2.387

2.656

3.225

one
.295

1 ££A

1.669

1 AAO

1.998

O O O^

2.386

2.655

o ooo
3.223

one
.295

1 ££A

1.669

1 AA*7

1.99/

o o o c
2.385

O C A

2.654

O OO A

3.220

.295

1.668

1.997

2.384

2.652

3.218

.294

1.668

1.996

2.383

2.651

3.216

.294

1.668

1.995

2.382

2.650

3.214

.294

1.667

1.995

2.382

2.649

3.213

.294

1.667

1.994

2.381

2.648

3.211

.294

1.667

1.994

2.380

2.647

3.209

.293

1.666

1.993

2.379

2.646

3.207

.293

1.666

1.993

2.379

2.645

3.206

.293

1.666

1.993

2.378

2.644

3.204

.293

1.665

1.992

2.377

2.643

3.202

.282

1.645

1.960

2.326

2.576

3.090

Table V of Appendix C.

PLUS

www.wileyplus.com

This online teaching and learning environment
integrates the entire digital textbook with the
most effective instructor and student resources
to fit every learning style.

With WileyPLUS: ^

D Students achieve concept
mastery in a rich,
structured environment
that's available 24/7

Instructors personalize and manage
their course more effectively with
assessment, assignments, grade
tracking, and more

manage time bette
• study smarter
• save money

J
r

From multiple study paths, to self-assessment, to a wealth of interactive
visual and audio resources, WileyPLUS gives you everything you need to
personalize the teaching and learning experience.

>Find out how to MAKE IT YOURS»

www.wileyplus.com

WILEY

PLUS

p ALL THE HELP, RESOURCES, AND PERSONAL SUPPORT

YOU AND YOUR STUDENTS NEED!

a ct DAY OF
■ CLASS

... AND BEYOND!

2-Minute Tutorials and all
of the resources you & your
students need to get started
www.wileyplus.com/firstday

WILEY

PLUS

QuickStart

Pre-loaded, ready-to-use
assignments and presentations
www.wiley.com/college/quickstart

WILEY

PLUS

Student Partner Program

Student support from an
experienced student user
Ask your local representative
for details!

Collaborate with your colleagues,
find a mentor, attend virtual and live
events, and view resources
www.WhereFacultyConnect.com

Technical Support 24/7 Your WileyPLUS

FAQs, online chat, Account Manager

and phone support Training and implementation support

www.wileyplus.com/support www.wileyplus.com/accountmanager

PLUS MAKE IT YOURS!

www.wileyplus.com

Seventh Edition

INTRODUCTORY STATISTICS

Seventh Edition

INTRODUCTORY STATISTICS

PREM S. MANN

EASTERN CONNECTICUT STATE UNIVERSITY

WITH THE HELP OF

CHRISTOPHER JAY LACKE

ROWAN UNIVERSITY

WILEY

JOHN WILEY & SONS, INC.

Vice President & Executive Publisher Laurie Rosatone

Project Editors Jenn Albanese and Ellen Keohane

Production Manager Dorothy Sinclair

Senior Production Editor Valerie A. Vargas

Marketing Manager Sarah Davis

Creative Director Harry Nolan

Designer Director Jeof Vita

Production Management Services Aptara®, Inc.

Senior Illustration Editor Anna Melhorn

Photo Associate Sarah Wilkin

Editorial Assistant Beth Pearson

Media Editors Melissa Edwards and Ari Wolfe

Cover Photo Credit © James Leynse/©Corbis

This book was set in 10/12 Times Roman by Aptara®, Inc. and printed and bound
by Courier-Kendallville. The cover was printed by Courier-Kendallville.

This book is printed on acid free paper. °°

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections
107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the
Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc.
222 Rosewood Drive, Danvers, MA 01923, website www.copyright.com. Requests to the Publisher for
permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., Ill River Street,
Hoboken, NJ 07030-5774, (201)748-6011, fax (201)748-6008, website http://www.wiley.com/go/permissions.

Evaluation copies are provided to qualified academics and professionals for review purposes only, for use
in their courses during the next academic year. These copies are licensed and may not be sold or transferred to
a third party. Upon completion of the review period, please return the evaluation copy to Wiley. Return instruc-
tions and a free of charge return shipping label are available at www.wiley.com/go/returnlabel. Outside of the
United States, please contact your local representative.

Library of Congress Cataloging in Publication Data

ISBN- 13 978-0-470-44466-5 (cloth)

ISBN- 13 978-0-470-55663-4 (Binder Ready Version )

Printed in the United States of America

10 987654 3 21

To my mother
and

to the memory of my father

BREEACE

Introductory Statistics is written for a one- or two-semester first course in applied statistics.
This book is intended for students who do not have a strong background in mathematics. The
only prerequisite for this text is knowledge of elementary algebra.

Today, college students from almost all fields of study are required to take at least one
course in statistics. Consequently, the study of statistical methods has taken on a prominent
role in the education of students from a variety of backgrounds and academic pursuits. From
the first edition, the goal of Introductory Statistics has been to make the subject of statistics
interesting and accessible to a wide and varied audience. Three major elements of this text
support this goal:

1. Realistic content of its examples and exercises, drawing from a comprehensive range of ap-
plications from all facets of life

2. Clarity and brevity of presentation

3. Soundness of pedagogical approach

These elements are developed through the interplay of a variety of significant text features.

The feedback received from the users of the sixth edition of Introductory Statistics has
been very supportive and encouraging. Positive experiences reported by instructors and stu-
dents have served as evidence that this text offers an interesting and accessible approach to
statistics — the author's goal from the very first edition. The author has pursued the same goal
through the refinements and updates in this seventh edition, so that Introductory Statistics
can continue to provide a successful experience in statistics to a growing number of students
and instructors.

New to the Seventh Edition

The following are some of the changes made in the seventh edition:

■ A large number of the examples and exercises are new, providing new and varied ways for
students to practice statistical concepts.

■ Most of the case studies are new or revised, drawing on current uses of statistics in areas
of student interest.

■ New chapter opening images and questions incorporate real data in familiar situations.

■ New data are integrated throughout, reinforcing the vibrancy of statistics and the relevance
of statistics to student lives right now.

■ The Technology Instruction sections have been heavily revised to support the use of the lat-
est versions of the TI-84/84+, Minitab, and Excel.

■ Many of the Uses and Misuses sections are either new or have been updated.

■ Many of the Decide for Yourself sections are either new or have been updated.

■ Several new Miniprojects have been added.

■ A large number of new Technology Assignments have been added.

vii

VIII

Preface

Style and Pedagogy

Hallmark Features of this Text

Clear and Concise Exposition The explanation of statistical methods and concepts is clear and
concise. Moreover, the style is user-friendly and easy to understand. In chapter introductions and
in transitions from section to section, new ideas are related to those discussed earlier.

Thorough Examples

Examples The text contains a wealth of examples, more than 200 in 15 chapters and Appendix
A. The examples are usually presented in a format showing a problem and its solution. They
are well sequenced and thorough, displaying all facets of concepts. Furthermore, the examples
capture students' interest because they cover a wide variety of relevant topics. They are based
on situations that practicing statisticians encounter every day. Finally, a large number of exam-
ples are based on real data taken from sources such as books, government and private data
sources and reports, magazines, newspapers, and professional journals.

Step-by-Step Solutions

Solutions A clear, concise solution follows each problem presented in an example. When the
solution to an example involves many steps, it is presented in a step-by-step format. For in-
stance, examples related to tests of hypothesis contain five steps that are consistently used to
solve such examples in all chapters. Thus, procedures are presented in the concrete settings of
applications rather than as isolated abstractions. Frequently, solutions contain highlighted re-
marks that recall and reinforce ideas critical to the solution of the problem. Such remarks add
to the clarity of presentation.

Enlightening Pedagogy

Margin Notes for Examples A margin note appears beside each example that briefly de-
scribes what is being done in that example. Students can use these margin notes to assist them
as they read through sections and to quickly locate appropriate model problems as they work
through exercises.

Frequent Use of Diagrams Concepts can often be made more understandable by describing
them visually with the help of diagrams. This text uses diagrams frequently to help students
understand concepts and solve problems. For example, tree diagrams are used extensively in
Chapters 4 and 5 to assist in explaining probability concepts and in computing probabilities.
Similarly, solutions to all examples about tests of hypothesis contain diagrams showing rejec-
tion regions, nonrejection regions, and critical values.

Highlighting Definitions of important terms, formulas, and key concepts are enclosed in
colored boxes so that students can easily locate them.

Cautions Certain items need special attention. These may deal with potential trouble spots
that commonly cause errors, or they may deal with ideas that students often overlook. Special
emphasis is placed on such items through the headings Remember, An Observation, or Warn-
ing. An icon is used to identify such items.

Realistic Applications

Case Studies Case studies, which appear in almost all chapters, provide additional illustra-
tions of the applications of statistics in research and statistical analysis. Most of these case stud-
ies are based on articles/snapshots published in journals, magazines, or newspapers. All case
studies are based on real data.

Abundant Exercises Exercises and Supplementary Exercises The text contains an abundance of exercises
(excluding Technology Assignments) — approximately 1500 in 15 chapters and Appendix A.
Moreover, a large number of these exercises contain several parts. Exercise sets appearing at the
end of each section (or sometimes at the end of two or three sections) include problems on the
topics of that section. These exercises are divided into two parts: Concepts and Procedures that
emphasize key ideas and techniques, and Applications that use these ideas and techniques in
concrete settings. Supplementary exercises appear at the end of each chapter and contain exer-
cises on all sections and topics discussed in that chapter. A large number of these exercises are
based on real data taken from varied data sources such as books, government and private data

Preface IX

sources and reports, magazines, newspapers, and professional journals. Not merely do the exer-
cises given in the text provide practice for students, but the real data contained in the exercises
provide interesting information and insight into economic, political, social, psychological, and
other aspects of life. The exercise sets also contain many problems that demand critical thinking
skills. The answers to selected odd-numbered exercises appear in the Answers section at the back
of the book. Optional exercises are indicated by an asterisk (*).

Advanced Exercises All chapters (except Chapters 1 and 14) have a set of exercises that are Challenging Problems
of greater difficulty. Such exercises appear under the heading Advanced Exercises as part of the
Supplementary Exercises.

Uses and Misuses This feature at the end of each chapter (before the Glossary) points out Misconceptions and Pitfalls

common misconceptions and pitfalls students will encounter in their study of statistics and in
everyday life. Subjects highlighted include such diverse topics as the use of the word average
and grading on a curve.

Decide for Yourself This feature appears at the end of each chapter (except Chapter 1) just Open-ended Problems

before the Technology Instruction section. In this section, a real-world problem is discussed,
and questions are raised about this problem that readers are required to answer.

Glossary Each chapter has a glossary that lists the key terms introduced in that chapter, along Summary and Review
with a brief explanation of each term. Almost all the terms that appear in boldface type in the
text are in the glossary.

Self-Review Tests Each chapter contains a Self-Review Test, which appears immediately after Testing Yourself

the Supplementary Exercises. These problems can help students test their grasp of the concepts
and skills presented in respective chapters and monitor their understanding of statistical meth-
ods. The problems marked by an asterisk (*) in the Self-Review Tests are optional. The answers
to almost all problems of the Self-Review Tests appear in the Answer section.

Formula Card A formula card that contains key formulas from all chapters and the normal Key Formulas
distribution and t distribution tables is included at the beginning of the book.

Technology Usage At the end of each chapter is a section covering uses of three major tech- Technology Usage

nologies of statistics and probability: the TI-84, Minitab, and Excel. For each technology, stu-
dents are guided through performing statistical analyses in a step-by-step fashion, showing them
how to enter, revise, format, and save data in a spreadsheet, workbook, or named and un-
named lists, depending on the technology used. Illustrations and screen shots demonstrate the
use of these technologies. Additional detailed technology instruction is provided in the technol-
ogy manuals that are online at www.wiley.com/college/mann.

Technology Assignments Each chapter contains a few technology assignments that appear at Technology Assignments

the end of the chapter. These assignments can be completed using any of the statistical software.

Miniprojects Each chapter contains a few Miniprojects that appear just before the Decide it Miniprojects
Yourself sections. These Miniprojects are like very comprehensive exercises or ask students to
perform their own surveys and experiments. They provide practical applications of statistical
concepts to real life.

Data Sets A large number of data sets appear on the Web site for the text that is located Data Sets
at www.wiley.com/college/mann. These data sets include the data for various exercises in
the text and eight large data sets. These eight large data sets are collected from various
sources and they contain information on several variables. Many exercises and assignments
in the text are based on these data sets. These large data sets can also be used for instruc-
tor-driven analyses using a wide variety of statistical software packages as well as the TI-
84. These data sets are available on the Web site of the text in a variety of formats in-
cluding Minitab 1 , Excel, and text formats.

'Minitab is a registered trademark of Minitab, Inc., Quality Plaza, 1829 Pine Hall Road, State College, PA 16801-3008.
Phone: 814-238-3280.

Preface

Statistical Animations Statistical Animations In relevant places throughout the text, an icon alerts students to the
availability of a statistical animation. These animations illustrate statistical concepts in the text,
and can be found on the companion Web-Site.

GAISE Report Recommendations Adopted

In 2003, the American Statistical Association (ASA) funded the Guidelines for Assessment and
Instruction in Statistics Education (GAISE) Project to develop ASA-endorsed guidelines for as-
sessment and instruction in statistics for the introductory college statistics course. The report,
which can be found at www.amstat.org/education/gaise, resulted in the following series of rec-
ommendations for the first course in statistics and data analysis.

1. Emphasize statistical literacy and develop statistical thinking.

2. Use real data.

3. Stress conceptual understanding rather than mere knowledge of procedures.

4. Foster active learning in the classroom.

5. Use technology for developing concepts and analyzing data.

6. Use assessments to improve and evaluate student learning.

Here are a few examples of how this Introductory Statistics text can assist in helping you, the
instructor, in meeting the GAISE recommendations.

1. Many of the newer exercises require interpretation, not just a number. Graphical and nu-
meric summaries are combined in some new exercises in order to emphasize looking at the
whole picture, as opposed to using just one graph or one summary statistic.

2. The Decide for Yourself and Uses and Misuses features help to develop statistical thinking
and conceptual understanding.

3. All of the data sets in the exercises and in Appendix B are available on the book's Web site.
They have been formatted for a variety of statistical software packages. This eliminates the
need to enter data into the software. A variety of software instruction manuals also allows the
instructor to spend more time on concepts, and less time teaching how to use technology.

4. The Miniprojects help students to generate their own data by performing an experiment
and/or taking random samples from the large datasets given in Appendix B.

We highly recommend that all statistics instructors take the time to read the GAISE report. There
is a wealth of information in this report that can be used by everyone.

Web Site

http://www.wiley.com/college/mann

The Web site for this text provides additional resources for instructors and students. The fol-
lowing items are available on this Web-site:

• Formula Card

• Statistical Animations

• Computerized Test Bank

• Instructor's Solutions Manual

• PowerPoint Slides

• Data Sets (see Appendix B for a complete list of these data sets)

• Technology Resource Manuals.

• TI Graphing Calculator Manual

• Minitab Manual

• Excel Manual

Preface XI

These manuals provide step-by-step instructions, screen captures, and examples for using tech-
nology in the introductory statistics course. Also provided are exercise tables and indications of
which exercises from the text best lend themselves to the use of the package presented.

• Chapter 14: Multiple Regression

• Chapter 15: Nonparametric Methods

Using WileyPLUS

This online teaching and learning environment integrates the entire digital textbook with
the most effective instructor and student resources to fit every learning style. With Wiley-
PLUS:

• Students achieve concept mastery in a rich, structured environment that is available 24/7.

• Instructors personalize and manage their course more effectively with assessment, assign-
ments, grade tracking, and more.

WileyPLUS can complement the current textbook or replace the printed text altogether.

For Students

Personalize the learning experience:

Different learning styles, different levels of proficiency, different levels of preparation — each
of your students is unique. WileyPLUS empowers them to take advantage of their individual
strengths:

• Students receive timely access to resources that address their demonstrated needs and get
immediate feedback and remediation when needed.

• Integrated, multimedia resources provide multiple studypaths to fit each student's learning
preferences and encourage more-active learning.

• WileyPLUS includes many opportunities for self-assessment linked to the relevant portions
of the text. Students can take control of their learning and practice until they master the
material.

For Instructors

Personalize the teaching experience:

WileyPLUS empowers you with the tools and resources you need to make your teaching even
more effective:

• You can customize your classroom presentation with a wealth of resources and functional-
ity, from PowerPoint slides to a database of rich visuals. You can even add your own mate-
rials to your WileyPLUS course.

• With WileyPLUS you can identify those students who are falling behind and intervene ac-
cordingly, without having to wait for them to come to your office hours.

• WileyPLUS simplifies and automates such tasks as student performance assessment, making
assignments, scoring student work, keeping grades, and more.

Supplements

The following supplements are available to accompany this text:

■ Instructor's Solutions Manual (ISBN 978-0-470-57241-2) This manual contains com-
pete solutions to all of the exercises in the text.

■ Printed Test Bank (ISBN 978-0-470-57242-9) The printed copy of the test bank con-
tains a large number of multiple-choice questions, essay questions, and quantitative prob-
lems for each chapter.

Xii Preface

■ Computerized Test Bank All of the questions in the Printed Test Bank are available elec-
tronically and can be obtained from the publisher.

■ Student Solutions Manual (ISBN 978-0-470-57239-9) This manual contains complete
solutions to all of the odd-numbered exercises in the text.

■ Student Study Guide (ISBN 978-0-470-57240-5) This guide contains review material
for a first course in statistics. Special attention is given to the critical material for each chap-
ter. Reviews of mathematical notation and formulas are also included.

Acknowledgments

I thank the following reviewers of this and/or previous editions of this book, whose comments
and suggestions were invaluable in improving the text.

Alfred A. Akinsete
Marshall University
Scott S. Albert
College of DuPage
Michael R. Allen

Tennessee Technological University
Peter Arvanites
Rockland Community College
K. S. Asal

Broward Community College
Louise Audette

Manchester Community College

Nicole Betsinger

Arapahoe Community College

Joan Bookbinder

Johnson & Wales University

Dean Burbank

Gulf Coast Community College
Helen Burn

Highline Community College
Gerald Busald
San Antonio College
Peter A. Carlson
Delta College
Jayanta Chandra
University of Notre Dame
C. K. Chauhan

Indiana-Purdue University at Fort Wayne
James Curl

Modesto Community College
Gregory Daubenmire
Las Positas Community College
Joe DeMaio

Kennesaw State University
Fred H. Dorner

Trinity University, San Antonio
William D. Ergle
Roanoke College, Salem, Virginia
Ruby Evans

Santa Fe Community College

Ronald Ferguson

San Antonio College

James C. Ford

Anda Gadidov

Kennesaw State University

Frank Goulard

Portland Community College

Robert Graham

Jacksonville State University,

Jacksonville, Alabama
Larry Griffey

Florida Community College,

Jacksonville
Arjun K. Gupta

Bowling Green State University
David Gurney

Southeastern Louisiana University
Daesung Ha
Marshall University
A. Eugene Hileman
Northeastern State University,

Tahlequah, Oklahoma
John G. Horner
Cabrillo College
Virginia Horner
Diablo Valley College
Ina Parks S. Howell
Florida International University
John Haussermann
Monterey Peninsular College
Shana Irwin

University of North Texas

Gary S. Itzkowitz

Rowan State College

Joanna Jeneralczuk

University of Massachusetts, Amherst

Jean Johnson

Governors State University
Michael Karelius

American River College, Sacramento

Preface Xlii

Dix J. Kelly

Central Connecticut State University
Jong Sung Kim
Portland State University
Linda Kohl

University of Michigan, Ann Arbor
Martin Kotler

Pace University, Pleasantville, New York
Marlene Kovaly

Florida Community College, Jacksonville

Hillel Kumin

University of Oklahoma

Carlos de la Lama

San Diego City College

Rita Lindsay

Indian River State College
Gaurab Mahapatra
University of Akron
Richard McGowan
University of Scranton
Daniel S. Miller

Central Connecticut State University

Dorothy Miners

Brock University

Satya N. Mishra

University of South Alabama

Jeffrey Mock

Diablo Valley College

Luis Moreno

Broome Community College, Binghamton
Robert A. Nagy

University of Wisconsin, Green Bay

Sharon Navard

The College of New Jersey

Nhu T. Nguyen

New Mexico State University

Paul T. Nkansah

Florida Agricultural and Mechanical

University
Joyce Oster

Johnson and Wales University
Lindsay Packer
College of Charleston
Mary Parker

Austin Community College

Roger Peck

University of Rhode Island, Kingston

Chester Piascik

Bryant College, Smithfield

Joseph Pigeon

Villanova University

Cristina Popescue,

Grant MacEwan College

Aaron Robertson

Colgate University

Gerald Rogers

New Mexico State University, Las Cruces
Emily Ross

University of Missouri, St. Louis

Juana Sanchez

UCLA

Brunilda Santiago

Indian River State College

Phillis Schumacher

Bryant College, Smithfield

Kathryn Schwartz

Scottsdale Community College

Ronald Schwartz

Wilkes University, Wilkes-Barre

David Stark

University of Akron

Larry Stephens

University of Nebraska, Omaha
Bruce Trumbo

California State University, Hayward
Vasant Waikar
Miami University
Jean Weber

University of Arizona, Tucson
Terry Wilson

San Jacinto College, Pasadena
James Wright
Bucknell University
K. Paul Yoon

Fairleigh Dickinson University, Madison
Zhiyi Zhang

University of North Carolina

I express my thanks to the following for their contributions to earlier editions of this book
that made it better in many ways: Maryanne Clifford (Eastern Connecticut State University),
Gerald Geissert, Daniel S. Miller (Central Connecticut State University), and David Santana-
Ortiz (Rand Organization).

I extend my special thanks to Christopher Lacke of Rowan University, who contributed to
this edition in many significant ways. Without his help, this book would not be in this form. I take
this opportunity to thank Ann Ostberg for preparing the answers section for the back of the book
and for checking the text examples for math accuracy, and to Sandra Zirkes for checking the an-
swers section for accuracy. In addition, I thank Eastern Connecticut State University for all the
support I received.

XiV Preface

It is of utmost importance that a textbook be accompanied by complete and accurate
supplements. I take pride in mentioning that the supplements prepared for this text possess
these qualities and much more. I thank the authors of all these supplements.

It is my pleasure to thank all the professionals at John Wiley with whom I enjoyed work-
ing during this revision. Among them are Laurie Rosatone (Vice President and Executive
Publisher), Jackie Henry (Full Service Manager), Jeof Vita (Art Director), Sarah Wilkin
(Photo Associate), Dorothy Sinclair (Production Manager), Valerie Vargas (Senior Production
Editor), Ellen Keohane (Editor), Beth Pearson (Editorial Assistant), Melissa Edwards (Media
Editor), Ari Wolfe (Media Project Editor), and Sarah Davis (Marketing Manager), Harry
Nolan (Creative Director), Anna Melhorn (Senior Illustration Editor). Lastly but most
importantly I extend my most heartfelt thanks to Jenn Albanese (Project Editor) whose support
and guidance was of immense help during this revision.

Any suggestions from readers for future revisions would be greatly appreciated. Such
suggestions can be sent to the author at mann@easternct.edu or premmann@yahoo.com.

Prem S. Mann
Willimantic, CT
November 2009

CONTENTS

Chapter 1 Introduction 1

1.1 What is Statistics? 2

1 .2 Types of Statistics 2

Case Study 1-1 2008 U.S. Patent Leaders 3

Case Study 1-2 TV Commercials and Holiday Shopping 4

1 .3 Population Versus Sample 5

Case Study 1 -3 On Road, It's "Do As I Say, Not As I Do" 7

1 .4 Basic Terms 8

1.5 Types of Variables 10

1.6 Cross-Section Versus Time-Series Data 13

1.7 Sources of Data 14

1.8 Summation Notation 15

Uses and Misuses/Glossary/Supplementary Exercises/Self-Review Test/Mini-Project/
Technology Instruction/Technology Assignments

Chapter 2 Organizing and Graphing Data 27

2.1 Raw Data 28

2.2 Organizing and Graphing Qualitative Data 28
Case Study 2-1 Career Choices for High School Students 31
Case Study 2-2 In or Out in 30 Minutes 32

2.3 Organizing and Graphing Quantitative Data 35
Case Study 2-3 Morning Grooming 40

2.4 Shapes of Histograms 44

2.5 Cumulative Frequency Distributions 51

2.6 Stem-and-Leaf Displays 54

2.7 Dotplots 58

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/
Mini-Projects/Decide for Yourself/Technology Instruction/Technology Assignments

XVi Contents

Chapter 3 Numerical Descriptive Measures 79

3.1 Measures of Central Tendency for Ungrouped Data 80
Case Study 3-1 Average Attendance at Baseball Games 83
Case Study 3-2 The Gender Pay Gap 85

3.2 Measures of Dispersion for Ungrouped Data 92

3.3 Mean, Variance, and Standard Deviation for Grouped Data 98

3.4 Use of Standard Deviation 105
Case Study 3-3 Here Comes the SD 108

3.5 Measures of Position 110

3.6 Box-and-Whisker Plot 115

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Appendix 3.1 /Self-Review Test/
Mini-Projects/Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 4 Probability 137

4.1 Experiment, Outcomes, and Sample Space 138

4.2 Calculating Probability 143

4.3 Counting Rule 149

4.4 Marginal and Conditional Probabilities 150
Case Study 4-1 Rolling Stops 153

4.5 Mutually Exclusive Events 154

4.6 Independent Versus Dependent Events 155

4.7 Complementary Events 156

4.8 Intersection of Events and the Multiplication Rule 161

Case Study 4-2 Baseball Players have "Slumps" and "Streaks" 167

4.9 Union of Events and the Addition Rule 171

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 5 Discrete Random Variables and Their Probability
Distributions 191

5.1 Random Variables 192

5.2 Probability Distribution of a Discrete Random Variable 194

5.3 Mean of a Discrete Random Variable 201

5.4 Standard Deviation of a Discrete Random Variable 202
Case Study 5-1 Aces High Instant Lottery Game-20th Edition 203

5.5 Factorials, Combinations, and Permutations 208
Case Study 5-2 Playing Lotto 212

5.6 The Binomial Probability Distribution 214

5.7 The Hypergeometric Probability Distribution 226

5.8 The Poisson Probability Distribution 230
Case Study 5-3 Ask Mr. Statistics 233

Case Study 5-4 Living and Dying in the USA 235

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Contents XVI i

Chapter 6 Continuous Random Variables and the Normal Distribution 250

6.1 Continuous Probability Distribution 251

6.2 The Normal Distribution 254

Case Study 6-1 Distribution of Time Taken to Run a Road Race 255

6.3 The Standard Normal Distribution 259

6.4 Standardizing a Normal Distribution 267

6.5 Applications of the Normal Distribution 273

6.6 Determining the z and x Values When an Area Under the Normal Distribution Curve
is Known 278

6.7 The Normal Approximation to the Binomial Distribution 283

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 7 Sampling Distributions 300

7.1 Population and Sampling Distributions 301

7.2 Sampling and Nonsampling Errors 303

7.3 Mean and Standard Deviation of x 306

7.4 Shape of the Sampling Distribution of x 310

7.5 Applications of the Sampling Distribution of x 316

7.6 Population and Sample Proportions 321

7.7 Mean, Standard Deviation, and Shape of the Sampling Distribution of p 323

7.8 Applications of the Sampling Distribution of p 328

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 8 Estimation of the Mean and Proportion 340

8.1 Estimation: An Introduction 341

8.2 Point and Interval Estimates 342

8.3 Estimation of a Population Mean: a Known 344

Case Study 8-1 Raising a Child 349

8.4 Estimation of a Population Mean: a Not Known 354

8.5 Estimation of a Population Proportion: Large Samples 362

Case Study 8-2 Which Sound Is the Most Frustrating to Hear? 365

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 9 Hypothesis Tests About the Mean and Proportion 381

9.1 Hypothesis Tests: An Introduction 382

9.2 Hypothesis Tests About fi: a Known 390

Case Study 9-1 How Crashes Affect Auto Premiums 399

9.3 Hypothesis Tests About fi: a Not Known 404

9.4 Hypothesis Tests About a Population Proportion: Large Samples 414

XViii Contents

Case Study 9-2 Favorite Seat in the Plane 420

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 1 Estimation and Hypothesis Testing: Two Populations 439

10.1 Inferences About the Difference Between Two Population Means for Independent Samples:
cr, and a 2 Known 440

10.2 Inferences About the Difference Between Two Population Means for Independent Samples:
cr, and cr 2 Unknown but Equal 447

Case Study 10-1 Average Compensation for Accountants 454

10.3 Inferences About the Difference Between Two Population Means for Independent Samples:
cr, and cr 2 Unknown and Unequal 457

10.4 Inferences About the Difference Between Two Population Means for Paired Samples 464

10.5 Inferences About the Difference Between Two Population Proportions for Large and Independent
Samples 473

Case Study 10-2 Is Vacation Important? 478

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 1 Chi-Square Tests 498

11.1 The Chi-Square Distribution 499

1 1 .2 A Goodness-of-Fit Test 502

Case Study 11-1 What Is Your Favorite Season? 508

11.3 Contingency Tables 511

11.4 A Test of Independence or Homogeneity 511

1 1 .5 Inferences About the Population Variance 523

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 1 2 Analysis of Variance 541

12.1 The F Distribution 542

12.2 One-Way Analysis of Variance 544

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 1 3 Simple Linear Regression 564

13.1 Simple Linear Regression Model 565

13.2 Simple Linear Regression Analysis 567

Case Study 13-1 Regression of Heights and Weights of NBA Players 574

13.3 Standard Deviation of Random Errors 581

13.4 Coefficient of Determination 582

13.5 Inferences About B 587

13.6 Linear Correlation 592

Contents Xix

13.7 Regression Analysis: A Complete Example 599

13.8 Using the Regression Model 606

13.9 Cautions in Using Regression 609

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Chapter 1 4 Multiple Regression

This chapter is not included in this text but is available for download on the Web site at www.wiley.com/college/mann.

14.1 Multiple Regression Analysis

14.2 Assumptions of the Multiple Regression Model

14.3 Standard Deviation of Errors

14.4 Coefficient of Multiple Determination

14.5 Computer Solutions of Multiple Regression

Uses and Misuses/Glossary/Self-Review Test/Mini-Projects/Decide for Yourself

Chapter 1 5 Nonparametric Methods

This chapter is not included in this text but is available for download on the Web site at www.wiley.com/college/mann.

15.1 The Sign Test

15.2 The Wilcoxon Signed-Rank Test for Two Dependent Samples

15.3 The Wilcoxon Rank Sum Test for Two Independent Samples

15.4 The Kruskal-Wallis Test

15.5 The Spearman Rho Rank Correlation Coefficient Test

15.6 The Runs Test for Randomness

Uses and Misuses/Glossary/Supplementary Exercises/Advanced Exercises/Self-Review Test/Mini-Projects/
Decide for Yourself/Technology Instruction/Technology Assignments

Appendix A Sample Surveys, Sampling Techniques, and Design of Experiments Al

A.1 Sources of Data Al

A.l.l Internal Sources Al

A.1.2 External Sources A2

A.1.3 Surveys and Experiments A2
Case Study A-l Is It a Simple Question? A3
A.2 Sample Surveys and Sampling Techniques A4

A.2.1 Why Sample? A4
A.2.2 Random and Nonrandom Samples A4
A.2.3 Sampling and Nonsampling Errors A5
A.2.4 Random Sampling Techniques A8

A.3 Design of Experiments A9

Case Study A-2 Do Antibacterial Soaps Work? A13

Exercises / Advanced Exercises / Glossary

Appendix B Explanation of Data Sets Bl

Data Set I: City Data Bl
Data Set II: Data on States B2

XX Contents

Data Set III: NBA Data B2

Data Set IV: Manchester (Connecticut) Road Race Data B3

Data Set V: Sample of 500 Observations Selected from Data Set IV B3

Data Set VI: Data on Movies B3

Data Set VII: Standard & Poor's 100 Index Data B4

Data Set VIII: McDonald's Data B4

Appendix C Statistical Tables CI

I Table of Binomial Probabilities C2

II Values of e A Cll

III Table of Poisson Probabilities CI 3

IV Standard Normal Distribution Table CI 9

V The f Distribution Table C21

VI Chi-Square Distribution Table C23

VII The F Distribution Table C24

Tables VIII through XII along with Chapters 14 and 15 are available on the Web site of this text.

VIII Critical Values of X for the Sign Test

IX Critical Values of T for the Wilcoxon Signed-Rank Test

X Critical Values of T for the Wilcoxon Rank Sum Test

XI Critical Values for the Spearman Rho Rank Correlation Coefficient Test

XII Critical Values for a Two-Tailed Runs Test with a = .05

Answers to Selected Odd-Numbered Exercises and Self-Review Tests AN1
Photo Credits PCI
Index II

Seventh Edition

INTRODUCTORY STATISTICS

Chapter

Introduction

Do you feel compelled to shop at specific stores during the holidays? If yes, do you know why?
Do you think TV commercials have anything to do with it? In a survey conducted by the National
Retail Federation, only 18% of adults said that TV commercials influence them to shop at specific
stores. Do you find this value to be surprising? (See Case Study 1-2).

The study of statistics has become more popular than ever over the past four decades or so. The in-
creasing availability of computers and statistical software packages has enlarged the role of statistics
as a tool for empirical research. As a result, statistics is used for research in almost all professions,
from medicine to sports. Today, college students in almost all disciplines are required to take at least
one statistics course. Almost all newspapers and magazines these days contain graphs and stories on
statistical studies. After you finish reading this book, it should be much easier to understand these
graphs and stories.

Every field of study has its own terminology. Statistics is no exception. This introductory chapter
explains the basic terms of statistics. These terms will bridge our understanding of the concepts and
techniques presented in subsequent chapters.

1.1 What Is Statistics?

1 .2 Types of Statistics

Case Study 1-1 2008 U.S.
Patent Leaders

Case Study 1-2 TV Commercials
and Holiday Shopping

1.3 Population Versus Sample

Case Study 1-3 On Road, It's
"Do as I Say, Not as I Do"

1 .4 Basic Terms

1 .5 Types of Variables

1 .6 Cross-Section Versus
Time-Series Data

1.7 Sources of Data

1.8 Summation Notation

2 Chapter 1 Introduction

1.1 What Is Statistics?

The word statistics has two meanings. In the more common usage, statistics refers to numeri-
cal facts. The numbers that represent the income of a family, the age of a student, the per-
centage of passes completed by the quarterback of a football team, and the starting salary of a
typical college graduate are examples of statistics in this sense of the word. A 1988 article in
U.S. News & World Report declared "Statistics are an American obsession." 1 During the 1988
baseball World Series between the Los Angeles Dodgers and the Oakland A's, the then NBC
commentator Joe Garagiola reported to the viewers numerical facts about the players' per-
formances. In response, fellow commentator Vin Scully said, "I love it when you talk statis-
tics." In these examples, the word statistics refers to numbers.
The following examples present some statistics:

1. During the 43rd Super Bowl on February 1, 2009, NBC charged $3 million for a 30-second
commercial.

2. New York City mayor, Michael Bloomberg, gave $239 million to charity in 2008.

3. According to the Chronicle of Philanthropy, Wal-Mart gave $337.9 million to charity in
2007.

4. According to the U.S. Department of Agriculture, about 900 million roses were imported
from Colombia to the United States in 2005.

5. According to a 2008 SHRM Employee Benefit Survey, 3% of large companies allow pets
at work.

6. According to the Centers for Disease Control and Prevention, flu costs the United States
about $87 billion a year in terms of direct medical costs, loss of life, and reduced quality
of life.

The second meaning of statistics refers to the field or discipline of study. In this sense of
the word, statistics is defined as follows.

Definition

Statistics Statistics is a group of methods used to collect, analyze, present, and interpret data
and to make decisions.

Every day we make decisions that may be personal, business related, or of some other kind.
Usually these decisions are made under conditions of uncertainty. Many times, the situations or
problems we face in the real world have no precise or definite solution. Statistical methods help
us make scientific and intelligent decisions in such situations. Decisions made by using statis-
tical methods are called educated guesses. Decisions made without using statistical (or scien-
tific) methods are pure guesses and, hence, may prove to be unreliable. For example, opening
a large store in an area with or without assessing the need for it may affect its success.

Like almost all fields of study, statistics has two aspects: theoretical and applied. Theoreti-
cal or mathematical statistics deals with the development, derivation, and proof of statistical
theorems, formulas, rules, and laws. Applied statistics involves the applications of those theo-
rems, formulas, rules, and laws to solve real-world problems. This text is concerned with ap-
plied statistics and not with theoretical statistics. By the time you finish studying this book, you
will have learned how to think statistically and how to make educated guesses.

1.2 Types of Statistics

Broadly speaking, applied statistics can be divided into two areas: descriptive statistics and inf-
erential statistics.

'"The Numbers Racket: How Polls and Statistics Lie," U.S. News & World Report, July 11, 1988, pp. 44-47.

USA TODAY Snapshots®

2008 U.S. patent leaders

Top five companies with most
patents earned in 2008:

2008 U.S.

PATENT

LEADERS

1 - First company en break 4,000 Issuances In one year
Source: IFI Paten: Intelligence

By Jac Vang and Sam Ward, USA TODAY

The accompanying chart shows the top five companies in the United States with the most patents. This Source. USA TODAY, January 21,

chart describes the data on patents as collected from these five companies and, hence, is an example of today. Chart reproduced with

descriptive statistics. permission.

1.2.1 Descriptive Statistics

Suppose we have information on the test scores of students enrolled in a statistics class. In sta-
tistical terminology, the whole set of numbers that represents the scores of students is called a
data set, the name of each student is called an element, and the score of each student is called
an observation. (These terms are defined in more detail in Section 1.4.)

A data set in its original form is usually very large. Consequently, such a data set is not
very helpful in drawing conclusions or making decisions. It is easier to draw conclusions from
summary tables and diagrams than from the original version of a data set. So, we reduce data
to a manageable size by constructing tables, drawing graphs, or calculating summary measures
such as averages. The portion of statistics that helps us do this type of statistical analysis is
called descriptive statistics.

Definition

Descriptive Statistics Descriptive statistics consists of methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures.

Both Chapters 2 and 3 discuss descriptive statistical methods. In Chapter 2, we learn how
to construct tables and how to graph data. In Chapter 3, we learn to calculate numerical sum-
mary measures, such as averages.

Case Study 1-1 presents an example of descriptive statistics.

1 .2.2 Inferential Statistics

In statistics, the collection of all elements of interest is called a population. The selection of a
few elements from this population is called a sample. (Population and sample are discussed in
more detail in Section 1.3.)

A major portion of statistics deals with making decisions, inferences, predictions, and fore-
casts about populations based on results obtained from samples. For example, we may make some
decisions about the political views of all college and university students based on the political

TV COM-
MERCIALS
AND
HOLIDAY
SHOPPING

USA TODAY Snapshots®

TV commercials and
holiday shopping

Did TV commercials
motivate you to shop at
a particular retailer?

Yes

^■18%

No impact,
I regularly shop there

44%

source: National Retail Federation survey of S.S50 adults
18 and older Margin of emir ±1 percentage point

By Jae Yang and Veronica Salozar, USA TODAY

The accompanying chart shows the degree to which TV commercials motivate people to shop at specific
retailers. According to this survey conducted by the National Retail Federation, 18% of adults included in
the survey said that they are influenced by such commercials, 30% said that they are not influenced because
they regularly shop at those stores, and 44% said that they are not influenced at all. The chart indicates
that there is ±1% margin of error. We will discuss the concept of margin of error in Chapter 8. Just to give
a quick and brief explanation, the margin of error means that the percentages given in the chart can change
in the plus or minus direction by 1% when applied to the population.

views of 1000 students selected from a few colleges and universities. As another example, we may
want to find the starting salary of a typical college graduate. To do so, we may select 2000 recent
college graduates, find their starting salaries, and make a decision based on this information. The
area of statistics that deals with such decision-making procedures is referred to as inferential
statistics. This branch of statistics is also called inductive reasoning or inductive statistics.

Definition

Inferential Statistics Inferential statistics consists of methods that use sample results to help
make decisions or predictions about a population.

Case Study 1-2 presents an example of inferential statistics. It shows the results of a sur-
vey in which people were asked whether or not TV commercials motivate them to shop at
specific retailers.

Chapters 8 through 13 and parts of Chapter 7 deal with inferential statistics.

Probability, which gives a measurement of the likelihood that a certain outcome will occur,
acts as a link between descriptive and inferential statistics. Probability is used to make statements
about the occurrence or nonoccurrence of an event under uncertain conditions. Probability and prob-
ability distributions are discussed in Chapters 4 through 6 and parts of Chapter 7.

EXERCISES

CONCEPTS AND PROCEDURES

1.1 Briefly describe the two meanings of the word statistics.

1.2 Briefly explain the types of statistics.

1 .3 Population Versus Sample 5

1.3 Population Versus Sample

We will encounter the terms population and sample on almost every page of this text. 2 Conse-
quently, understanding the meaning of each of these two terms and the difference between them
is crucial.

Suppose a statistician is interested in knowing

1. The percentage of all voters in a city who will vote for a particular candidate in an election

2. The 2009 gross sales of all companies in New York City

3. The prices of all houses in California

In these examples, the statistician is interested in all voters, all companies, and all houses.
Each of these groups is called the population for the respective example. In statistics, a popu-
lation does not necessarily mean a collection of people. It can, in fact, be a collection of people
or of any kind of item such as houses, books, television sets, or cars. The population of inter-
est is usually called the target population.

Definition

Population or Target Population A population consists of all elements — individuals, items, or
objects — whose characteristics are being studied. The population that is being studied is also
called the target population.

Most of the time, decisions are made based on portions of populations. For example, the
election polls conducted in the United States to estimate the percentages of voters who favor
various candidates in any presidential election are based on only a few hundred or a few thou-
sand voters selected from across the country. In this case, the population consists of all regis-
tered voters in the United States. The sample is made up of a few hundred or few thousand vot-
ers who are included in an opinion poll. Thus, the collection of a few elements selected from a
population is called a sample.

Definition

Sample A portion of the population selected for study is referred to as a sample.
Figure 1 . 1 illustrates the selection of a sample from a population.

Figure 1.1 Population and sample.

2 To learn more about sampling and sampling techniques, refer to Appendix A.

6 Chapter 1 Introduction

The collection of information from the elements of a population or a sample is called a
survey. A survey that includes every element of the target population is called a census. Often
the target population is very large. Hence, in practice, a census is rarely taken because it is
expensive and time-consuming. In many cases, it is even impossible to identify each element
of the target population. Usually, to conduct a survey, we select a sample and collect the re-
quired information from the elements included in that sample. We then make decisions based
on this sample information. Such a survey conducted on a sample is called a sample survey.
As an example, if we collect information on the 2009 incomes of all families in Connecticut,
it will be referred to as a census. On the other hand, if we collect information on the 2009
incomes of 50 families from Connecticut, it will be called a sample survey.

Definition

Census and Sample Survey A survey that includes every member of the population is called a
census. The technique of collecting information from a portion of the population is called a

sample sun'ey.

Case Study 1-3 presents an example of a sample survey.

The purpose of conducting a sample survey is to make decisions about the corresponding
population. It is important that the results obtained from a sample survey closely match the re-
sults that we would obtain by conducting a census. Otherwise, any decision based on a sample
survey will not apply to the corresponding population. As an example, to find the average income
of families living in New York City by conducting a sample survey, the sample must contain
families who belong to different income groups in almost the same proportion as they exist in
the population. Such a sample is called a representative sample. Inferences derived from a rep-
resentative sample will be more reliable.

Definition

Representative Sample A sample that represents the characteristics of the population as closely
as possible is called a representative sample.

A sample may be random or nonrandom. In a random sample, each element of the pop-
ulation has a chance of being included in the sample. However, in a nonrandom sample this
may not be the case.

Definition

Random Sample A sample drawn in such a way that each element of the population has a
chance of being selected is called a random sample. If all samples of the same size selected from
a population have the same chance of being selected, we call it simple random sampling. Such
a sample is called a simple random sample.

One way to select a random sample is by lottery or draw. For example, if we are to select
5 students from a class of 50, we write each of the 50 names on a separate piece of paper. Then
we place all 50 slips in a box and mix them thoroughly. Finally, we randomly draw 5 slips from
the box. The 5 names drawn give a random sample. On the other hand, if we arrange all 50
names alphabetically and then select the first 5 names on the list, it is a nonrandom sample be-
cause the students listed 6th to 50th have no chance of being included in the sample.

A sample may be selected with or without replacement. In sampling with replacement,
each time we select an element from the population, we put it back in the population before we

Percentage of adults who said
they had done these things in
the previous 30 days:

Survey results

ON THE
ROAD, IT'S
"DO AS I
SAY, NOT

AS I DO

Exceeded the speed limit by
1 5 mph on major highways

[5%

Exceeded the speed limit
by 15 mph on neighbor-
hood streets

15%

Deliberately ran red lights
■ 6%

Source: AAA Foundation for Traffic
Safely Oct.25-Jan. 10 survey or2.509
adu]rs.

By Alejandro Gonzalez. USA TODAY

Most motorists in the USA— 78%— call aggressive driving a serious concern yet nearly half admit speeding
on major highways in the past 30 days, according to a survey and analysis of research made public today
by the AAA Foundation for Traffic Safety.

Drivers also confessed to recently speeding on residential streets, speeding up to beat yellow lights,
honking at other drivers and tailgating, a AAA Foundation survey of 2509 adults found.

The report reflects what the Washington, D.C.-based, not-for-profit arm of the automobile club calls
American drivers' "do as I say, not as I do" attitude.

Changing that attitude is the first step toward making roads safe from aggressive driving, says Peter
Kissinger, president and CEO of the foundation, who notes that traffic crashes kill someone every 13 minutes.
Aggressive driving is a factor in 56% of fatal crashes, according to a Foundation study of federal data of
fatal crashes from 2003 through 2007, the most recent data available.

"We count (traffic deaths) in ones, twos and threes, as opposed to a plane falling out of the sky,
which gets major attention." Kissinger says, "But this is a major public health crisis."

Aggressive driving, the National Highway Traffic Safety Administration says, occurs when "an individ-
ual commits a combination of moving traffic offenses so as to endanger other persons or property." These
often include speeding, tailgating, improper lane changes, failing to yield the right of way, improper pass-
ing, and running red lights.

In the past 10 years, 14 states have taken steps against aggressive driving, according to the Gover-
nors Highway Safety Association.

"Traffic Congestion is generally the No. 1 cause for aggressive driving," says Thomas Gianni, deputy
chief of the Maryland Highway Safety Office and regional coordinator of a crackdown on aggressive driv-
ing by 150 police agencies in Maryland, Virginia, and Washington, D.C. "People are trying to get to too
many places and they don't allow themselves the time to get where they need to be."

Amanda Cooke, 21, a computer teacher in Running Springs, Calif., says she used to drive so aggres-
sively that her boyfriend was afraid to ride with her. "I'd cut people off to get into the lane I wanted to
get in," she says. "I'd tailgate them if they were going too slow or blink my lights if it was night." Cooke
says she stopped driving that way after crashing into another driver. "I didn't think it was as risky as it
was," she says.

Source: Larry Copeland, USA TODAY, April 21, 2009. Reproduced with permission.

8 Chapter 1 Introduction

select the next element. Thus, in sampling with replacement, the population contains the same
number of items each time a selection is made. As a result, we may select the same item more
than once in such a sample. Consider a box that contains 25 marbles of different colors.
Suppose we draw a marble, record its color, and put it back in the box before drawing the next
marble. Every time we draw a marble from this box, the box contains 25 marbles. This is an
example of sampling with replacement. The experiment of rolling a die many times is another
example of sampling with replacement because every roll has the same six possible outcomes.

Sampling without replacement occurs when the selected element is not replaced in the
population. In this case, each time we select an item, the size of the population is reduced by
one element. Thus, we cannot select the same item more than once in this type of sampling.
Most of the time, samples taken in statistics are without replacement. Consider an opinion poll
based on a certain number of voters selected from the population of all eligible voters. In this
case, the same voter is not selected more than once. Therefore, this is an example of sampling
without replacement.

EXERCISES

CONCEPTS AND PROCEDURES

1.3 Briefly explain the terms population, sample, representative sample, random sample, sampling with
replacement, and sampling without replacement.

1.4 Give one example each of sampling with and sampling without replacement.

1.5 Briefly explain the difference between a census and a sample survey. Why is conducting a sample
survey preferable to conducting a census?

■ APPLICATIONS

1.6 Explain whether each of the following constitutes a population or a sample.

a. Pounds of bass caught by all participants in a bass fishing derby

b. Credit card debts of 100 families selected from a city

c. Number of home runs hit by all Major League baseball players in the 2009 season

d. Number of parole violations by all 2147 parolees in a city

e. Amount spent on prescription drugs by 200 senior citizens in a large city

1.7 Explain whether each of the following constitutes a population or a sample.

a. Number of personal fouls committed by all NBA players during the 2008-2009 season

b. Yield of potatoes per acre for 10 pieces of land

c. Weekly salaries of all employees of a company

d. Cattle owned by 100 farmers in Iowa

e. Number of computers sold during the past week at all computer stores in Los Angeles

1.4 Basic Terms

It is very important to understand the meaning of some basic terms that will be used frequently
in this text. This section explains the meaning of an element (or member), a variable, an
observation, and a data set. An element and a data set were briefly defined in Section 1.2. This
section defines these terms formally and illustrates them with the help of an example.

Table 1.1 gives information on the 2007 charitable givings (in millions of U.S. dollars) by
six retail companies. We can call this group of companies a sample of six companies. Each
company listed in this table is called an element or a member of the sample. Table 1.1 con-
tains information on six elements. Note that elements are also called observational units.

Definition

Element or Member An element or member of a sample or population is a specific subject or object
(for example, a person, firm, item, state, or country) about which the information is collected.

1 .4 Basic Terms 9

An element 1
or a member J

Table 1.1 Charitable Givings of Six Retailers in 2007

iUU / l^IiariLaliie VrlVlIlga

Company

(millions of dollars)

Home Depot

Macy's

35.2

|Wal-Mart |

|337.9 |<— [

Best Buy

31.8

Target

168.9

Lowe's

27.5

Variable

An observation
or measurement

Source: The Chronicle of Philanthropy.

The 2007 charitable givings in our example is called a variable. The 2007 charitable givings
is a characteristic of companies that we are investigating or studying.

Definition

Variable A variable is a characteristic under study that assumes different values for different
elements. In contrast to a variable, the value of a constant is fixed.

Other examples of variables are the incomes of households, the number of houses built in
a city per month during the past year, the makes of cars owned by people, the gross profits of
companies, and the number of insurance policies sold by a salesperson per day during the past
month.

In general, a variable assumes different values for different elements, as does the 2007 char-
itable givings of the six companies in Table 1.1. For some elements in a data set, however, the
values of the variable may be the same. For example, if we collect information on incomes of
households, these households are expected to have different incomes, although some of them
may have the same income.

A variable is often denoted by x, y, or z. For instance, in Table 1.1, the 2007 charitable giv-
ings of companies may be denoted by any one of these letters. Starting with Section 1.8, we
will begin to use these letters to denote variables.

Each of the values representing the 2007 charitable givings of the six companies in Table 1.1
is called an observation or measurement.

Definition

Observation or Measurement The value of a variable for an element is called an observation or
measurement.

From Table 1.1, the 2007 charitable givings of Wal-Mart were $337.9 million. The value
$337.9 million is an observation or a measurement. Table 1.1 contains six observations, one for
each of the six retail companies.

The information given in Table 1.1 on 2007 charitable givings of companies is called the
data or a data set.

Definition

Data Set A data set is a collection of observations on one or more variables.

10 Chapter 1 Introduction

Other examples of data sets are a list of the prices of 25 recently sold homes, test scores
of 15 students, opinions of 100 voters, and ages of all employees of a company.

EXERCISES

CONCEPTS AND PROCEDURES

1.8 Explain the meaning of an element, a variable, an observation, and a data set.

■ APPLICATIONS

1.9 The following table gives the number of dog bites reported to the police last year in six cities.

City

Number of Bites

Center City

Elm Grove

Franklin

Bay City

Oakdale

Sand Point

Briefly explain the meaning of a member, a variable, a measurement, and a data set with reference to this table.

1.10 The following table gives the state taxes (in dollars) on a pack of cigarettes for nine states as of
April 1, 2009.

State Tax

State

(in dollars)

Alaska

2.00

Iowa

1.36

Massachusetts

2.51

Missouri

.17

New Hampshire

1.33

New York

2.75

Ohio

1.25

South Carolina

.07

West Virginia

.55

Briefly explain the meaning of a member, a variable, a measurement, and a data set with reference to this table.

1.11 Refer to the data set in Exercise 1.9.

a. What is the variable for this data set?

b. How many observations are in this data set?

c. How many elements does this data set contain?

1.12 Refer to the data set in Exercise 1.10.

a. What is the variable for this data set?

b. How many observations are in this data set?

c. How many elements does this data set contain?

1.5 Types of Variables

In Section 1.4, we learned that a variable is a characteristic under investigation that assumes
different values for different elements. The incomes of families, heights of persons, gross sales
of companies, prices of college textbooks, makes of cars owned by families, number of acci-
dents, and status (freshman, sophomore, junior, or senior) of students enrolled at a university
are examples of variables.

1 .5 Types of Variables 1 1

A variable may be classified as quantitative or qualitative. These two types of variables are
explained next.

1.5.1 Quantitative Variables

Some variables (such as the price of a home) can be measured numerically, whereas others (such
as hair color) cannot. The first is an example of a quantitative variable and the second that of
a qualitative variable.

Definition

Quantitative Variable A variable that can be measured numerically is called a quantitative vari-
able. The data collected on a quantitative variable are called quantitative data.

Incomes, heights, gross sales, prices of homes, number of cars owned, and number of ac-
cidents are examples of quantitative variables because each of them can be expressed numerically.
For instance, the income of a family may be $81,520.75 per year, the gross sales for a company
may be $567 million for the past year, and so forth. Such quantitative variables may be classi-
fied as either discrete variables or continuous variables.

Discrete Variables

The values that a certain quantitative variable can assume may be countable or noncountable.
For example, we can count the number of cars owned by a family, but we cannot count the
height of a family member. A variable that assumes countable values is called a discrete vari-
able. Note that there are no possible intermediate values between consecutive values of a dis-
crete variable.

Definition

Discrete Variable A variable whose values are countable is called a discrete variable. In
other words, a discrete variable can assume only certain values with no intermediate values.

For example, the number of cars sold on any day at a car dealership is a discrete variable
because the number of cars sold must be 0, 1, 2, 3, . . . and we can count it. The number of cars
sold cannot be between and 1, or between 1 and 2. Other examples of discrete variables are
the number of people visiting a bank on any day, the number of cars in a parking lot, the num-
ber of cattle owned by a farmer, and the number of students in a class.

Continuous Variables

Some variables cannot be counted, and they can assume any numerical value between two num-
bers. Such variables are called continuous variables.

Definition

Continuous Variable A variable that can assume any numerical value over a certain interval or
intervals is called a continuous variable.

The time taken to complete an examination is an example of a continuous variable because
it can assume any value, let us say, between 30 and 60 minutes. The time taken may be 42.6
minutes, 42.67 minutes, or 42.674 minutes. (Theoretically, we can measure time as precisely as

12 Chapter 1 Introduction

we want.) Similarly, the height of a person can be measured to the tenth of an inch or to the
hundredth of an inch. However, neither time nor height can be counted in a discrete fashion.
Other examples of continuous variables are weights of people, amount of soda in a 12-ounce can
(note that a can does not contain exactly 12 ounces of soda), and yield of potatoes (in pounds)
per acre. Note that any variable that involves money is considered a continuous variable.

1.5.2 Qualitative or Categorical Variables

Variables that cannot be measured numerically but can be divided into different categories are
called qualitative or categorical variables.

Definition

Qualitative or Categorical Variable A variable that cannot assume a numerical value but can be
classified into two or more nonnumeric categories is called a qualitative or categorical variable.
The data collected on such a variable are called qualitative data.

For example, the status of an undergraduate college student is a qualitative variable because
a student can fall into any one of four categories: freshman, sophomore, junior, or senior. Other
examples of qualitative variables are the gender of a person, the brand of a computer, the opin-
ions of people, and the make of a car.

Figure 1.2 illustrates the types of variables.

Variable

Quantitative

Discrete
(e.g., number of
houses, cars,
accidents)

Continuous
(e.g., length,
age, height,
weight, time)

Qualitative or categorical
(e.g., make of a
computer, opinions of
people, gender)

Figure 1 .2 Types of variables.

EXERCISES

CONCEPTS AND PROCEDURES

1.13 Explain the meaning of the following terms.

a. Quantitative variable

b. Qualitative variable

c. Discrete variable

d. Continuous variable

e. Quantitative data

f. Qualitative data

■ APPLICATIONS

1.14 Indicate which of the following variables are quantitative and which are qualitative.

a. Number of persons in a family

b. Colors of cars

c. Marital status of people

d. Time to commute from home to work

e. Number of errors in a person's credit report

1 .6 Cross-Section Versus Time-Series Data

1.15 Indicate which of the following variables are quantitative and which are qualitative.

a. Number of typographical errors in newspapers

b. Monthly TV cable bills

c. Spring break locations favored by college students

d. Number of cars owned by families

e. Lottery revenues of states

1.16 Classify the quantitative variables in Exercise 1.14 as discrete or continuous.

1.17 Classify the quantitative variables in Exercise 1.15 as discrete or continuous.

1.6 Cross-Section Versus Time-Series Data

Based on the time over which they are collected, data can be classified as either cross-section
or time-series data.

1.6.1 Cross-Section Data

Cross-section data contain information on different elements of a population or sample for
the same period of time. The information on incomes of 100 families for 2009 is an example
of cross-section data. All examples of data already presented in this chapter have been cross-
section data.

Definition

Cross-Section Data Data collected on different elements at the same point in time or for the
same period of time are called cross-section data.

Table 1 . 1 is reproduced here as Table 1 .2 that shows the 2007 charitable givings of six re-
tail companies. Because this table presents data on the charitable givings of six companies for
the same period (2007), it is an example of cross-section data.

Table 1.2 Charitable Givings of Six Retailers in 2007

2007 Charitable Givings
Company (millions of dollars)

Home Depot

Macy's

35.2

Wal-Mart

337.9

Best Buy

31.8

Target

168.9

Lowe's

27.5

Source: The Chronicle of Philanthropy.

1 .6.2 Time-Series Data

Time-series data contain information on the same element for different periods of time. Infor-
mation on U.S. exports for the years 1983 to 2009 is an example of time-series data.

Definition

Time-Series Data Data collected on the same element for the same variable at different points
in time or for different periods of time are called time-series data.

14 Chapter 1 Introduction

The data given in Table 1.3 are an example of time-series data. This table lists the total
number of indoor movie screens in the United States for the years 2003 to 2008. Note that each
screen in each theater counts as one. For example, a movieplex with 8 screens would count as
8 toward the total number of screens.

Table 1.3 Number of Movie Screens

Total Indoor

Year

Movie Screens

2003

35,361

2004

36,012

2005

37,092

2006

37,776

2007

38,159

2008

38,198

Source: National Association for Theater Owners.

1.7 Sources of Data

The availability of accurate and appropriate data is essential for deriving reliable results. 3 Data
may be obtained from internal sources, external sources, or surveys and experiments.

Many times data come from internal sources, such as a company's personnel files or ac-
counting records. For example, a company that wants to forecast the future sales of its prod-
uct may use the data of past periods from its records. For most studies, however, all the data
that are needed are not usually available from internal sources. In such cases, one may have
to depend on outside sources to obtain data. These sources are called external sources. For in-
stance, the Statistical Abstract of the United States (published annually), which contains var-
ious kinds of data on the United States, is an external source of data.

A large number of government and private publications can be used as external sources of
data. The following is a list of some government publications.

1. Statistical Abstract of the United States

2. Employment and Earnings

3. Handbook of Labor Statistics

4. Source Book of Criminal Justice Statistics

5. Economic Report of the President

6. County & City Data Book

7. State & Metropolitan Area Data Book

8. Digest of Education Statistics

9. Health United States
10. Agricultural Statistics

Most of the data contained in these books can be accessed on Internet sites such as www.
census.gov (Census Bureau), www.bls.gov (Bureau of Labor Statistics), www.ojp.usdoj.gov/bjs
(Office of Justice Program, U.S. Department of Justice, Bureau of Justice Statistics), www.os.
dhhs.gov (U.S. Department of Health and Human Services), and www.usda.gov/nass/pubs/
agstats.htm (U.S. Department of Agriculture, Agricultural Statistics).

Besides these government publications, a large number of private publications (e.g.,
Standard & Poors' Security Owner's Stock Guide and World Almanac and Book of Facts) and

3 Sources of data are discussed in more detail in Appendix A.

1.8 Summation Notation 15

periodicals (e.g., The Wall Street Journal, USA TODAY, Fortune, Forbes, and Business Week)
can be used as external data sources.

Sometimes the needed data may not be available from either internal or external sources.
In such cases, the investigator may have to conduct a survey or experiment to obtain the
required data. Appendix A discusses surveys and experiments in detail.

EXERCISES

CONCEPTS AND PROCEDURES

1.18 Explain the difference between cross-section and time-series data. Give an example of each of these
two types of data.

1.19 Briefly describe internal and external sources of data.

■ APPLICATIONS

1.20 Classify the following as cross-section or time-series data.

a. Food bill of a family for each month of 2009

b. Number of armed robberies each year in Dallas from 1998 to 2009

c. Number of supermarkets in 40 cities on December 31, 2009

d. Gross sales of 200 ice cream parlors in July 2009

1.21 Classify the following as cross-section or time-series data.

a. Average prices of houses in 100 cities

b. Salaries of 50 employees

c. Number of cars sold each year by General Motors from 1980 to 2009

d. Number of employees employed by a company each year from 1985 to 2009

1.8 Summation Notation

Sometimes mathematical notation helps express a mathematical relationship concisely. This sec-
tion describes the summation notation that is used to denote the sum of values.

Suppose a sample consists of five literary books, and the prices of these five books are $75,
$80, $35, $97, and $88, respectively. The variable price of a book can be denoted by x. The
prices of the five books can be written as follows:

Price of the first book = x l = $75
t

Subscript of x denotes the
number of the book

Similarly,

Price of the second book = x 2 = $80
Price of the third book = x 3 = $35
Price of the fourth book = x 4 = $97
Price of the fifth book = x 5 = $88

In this notation, x represents the price, and the subscript denotes a particular book.
Now, suppose we want to add the prices of all five books. We obtain

jcj + x 2 + x 3 + x 4 + x 5 = 75 + 80 + 35 + 97 + 88 = $375

The uppercase Greek letter X (pronounced sigma) is used to denote the sum of all values.
Using 2 notation, we can write the foregoing sum as follows:

2x = X; + x 2 + x 3 + x 4 + x 5 = $375

16 Chapter 1 Introduction

The notation Xx in this expression represents the sum of all the values of x and is read as "sigma
x" or "sum of all values of x."

Using summation
notation: one variable.

■ EXAMPLE 1-1

Annual salaries (in thousands of dollars) of four workers are 75, 90, 125, and 61, respectively. Find
(a) Xx (b) (Xx) 2 (c) Xx 2

Solution Let x x , x 2 , x 3 , and x 4 be the annual salaries (in thousands of dollars) of the first,
second, third, and fourth worker, respectively. Then,

x l = 75, x 2 = 90, x 3 = 125, and x 4 = 61

(a) Xx = xi + x 2 + x 3 + x 4 = 75 + 90 + 125 + 61 = 351 = $351,000

(b) Note that (X*) 2 is the square of the sum of all x values. Thus,

(Xx) 2 = (351) 2 = 123,201

(c) The expression Xx 2 is the sum of the squares of x values. To calculate Xx 2 , we first
square each of the x values and then sum these squared values. Thus,

Xx 2 = (75) 2 + (90) 2 + (125) 2 + (61) 2

= 5625 + 8100 + 15,625 + 3721 = 33,071 ■

Using summation
notation: two variables.

■ EXAMPLE 1-2

The following table lists four pairs of m and /values:

Compute the following:

(a) Xm (b) X/ 2 (c) Xmf

(d) Xm 2 /

Solution We can write

m x = 12 m 2 = 15 m 3 = 20 m 4 = 30
/ = 5 f 2 = 9 / = 10 U = 16

(a) Xm = 12 + 15 + 20 + 30 = 77

(b) X/ 2 = (5) 2 + (9) 2 + (10) 2 + (16) 2 = 25 + 81 + 100 + 256 = 462

Xm/ = mj x + m 2 f 2 + m 3 f 3 + m 4 f 4

= 12(5) + 15(9) + 20(10) + 30(16) = 875

(d) To calculate Xm 2 / we square each m value, then multiply the corresponding m and
/values, and add the products. Thus,

Xm 2 /= (m,) 2 /, + (m 2 ) 2 / 2 + (m 3 ) 2 / + (m 4 ) 2 / 4

= (12) 2 (5) + (15) 2 (9) + (20) 2 (10) + (30) 2 (16) = 21,145

The calculations done in parts (a) through (d) to find the values of Xm, X/ 2 , Xm/ and Xm 2 /
can be performed in tabular form, as shown in Table 1.4.

Table 1.4

m 2 f

5X5

= 25

12 X 5 = 60

12 X 12 X 5 = 720

9X9

= 81

15 X 9 = 135

X 15 X 9 = 2025

10 X 10

= 100

20 X 10 = 200

X 20 X 10 = 4000

16 X 16

= 256

30 X 16 = 480

X 30 X 16 = 14,400

tm = 77

V= 40

2/ 2 =

462

2m/ = 875

2m 2 / = 21,145

The columns of Table 1.4 can be explained as follows.

1. The first column lists the values of m. The sum of these values gives %m

11.

2. The second column lists the values of / The sum of this column gives 2/ = 40.

3. The third column lists the squares of the /values. For example, the first value, 25, is
the square of 5. The sum of the values in this column gives X/ 2 = 462.

4. The fourth column records products of the corresponding m and /values. For exam-
ple, the first value, 60, in this column is obtained by multiplying 12 by 5. The sum of
the values in this column gives Xmf = 875.

5. Next, the m values are squared and multiplied by the corresponding/ values. The re-
sulting products, denoted by mf, are recorded in the fifth column. For example, the
first value, 720, is obtained by squaring 12 and multiplying this result by 5. The sum
of the values in this column gives %m 2 f = 21,145.

EXERCISES

CONCEPTS AND PROCEDURES

1.22 The following table lists five pairs of m and /values.

Compute the value of each of the following:

a. 2m b. 2/ 2 c. 2m/ d. 2m 2 /
1.23 The following table lists six pairs of m and /values.

Calculate the value of each of the following:

a. 2/

b. Xr,

c. 2m/ d. 2m 2 /

1.24 The following table contains information on the NCAA Men's Basketball Championship tournament
Final Four teams for the 31 -year period from 1979 to 2009. The table shows how many teams with each
seeding qualified for the Final Four during these 31 years. For example, 53 of the 124 Final Four teams
during these 31 years were seeded number one, 27 of the 124 Final Four teams were seeded number two,
and so on. Note that none of the teams seeded number 10 qualified for the Final Four in these 31 years.

Seed

Number of Teams in Men's Final Four

Let y denote the seed and x denote the number of teams having that seed. Calculate the following:
a. 2x b. 2y c. 2xy d. %y 2 e. (2y) 2

18 Chapter 1 Introduction

1.25 The following table contains the same kind of information as the table in Exercise 1.24 but for the
NCAA Women's Basketball Championship tournament Final Four teams for 28 years from 1982 to 2009.

Seed

1 2

Number of Teams in Women's Final Four

59 27

Let y denote the seed and x denote the number of teams having that seed. Calculate the following:
a. %x b. Xy c. 2xy d. Sy 2 e. (Zyj 2

■ APPLICATIONS

1.26 Eight randomly selected customers at a local grocery store spent the following amounts on groceries
in a single visit: $216, $184, $35, $92, $144, $175, $11, and $57, respectively. Let y denote the amount
spent on groceries in a single visit. Find:

a. ty b. (2y) 2 e. Xy 2

1.27 The number of pizzas delivered to a college campus on six randomly selected nights is 48, 103, 95,
188, 286, and 136, respectively. Let x denote the number of pizzas delivered to this college campus on any
given night. Find:

a. %x b. (2jc) 2 c. %x 2

1.28 Prices (in thousands of dollars) of five new cars are 28, 35, 39, 54, and 18, respectively. Let x be the
price of a new car in this sample. Find:

a. tx b. (2jc) 2 c. tx 2

1.29 The number of students (rounded to the nearest thousand) currently enrolled at seven universities is
7, 39, 21, 16, 3, 43, and 19, respectively. Let x be the number of students currently enrolled at a univer-
sity. Find:

a. Xx b. (Zx) 2 c. Sjc 2

USES AND MISUSES... speaking the language of statistics

Have you ever heard the statistic "the average American family has
2.1 children?" What is wrong with this statement, and how do we fix
it? How about: "In a representative sample of 10 American families,
one can expect there to be 21 children." The statement is wordy but
more accurate. Why do we care?

Statisticians pay close attention to definitions because, without
them, calculations would be impossible to make and interpretations
of the data would be meaningless. Often, when you read statistics
reported in the newspaper, the journalist or editor sometimes
chooses to describe the results in a way that is easier to understand
but that distorts the actual statistical result.

Let us pick apart our example. The word average has a very spe-
cific meaning in probability (Chapters 4 and 5). The intended meaning
of the word here really is typical. The adjective American helps us de-
fine the population. The Census Bureau defines family as "a group of
two people or more (one of whom is the householder) related by birth,
marriage, or adoption and residing together; all such people (including

related subfamily members) are considered as members of one family."
It defines children as "all persons under 18 years, excluding people who
maintain households, families, or subfamilies as a reference person or
spouse." We understand implicitly that a family cannot have a fractional
number of children, so we accept that this discrete variable takes on
the properties of a continuous variable when we are talking about the
characteristics of a large population. How large does the population
need to be before we can derive continuous variables from discrete
variables? The answer comes in the chapters that follow.

The moral of the story is that whenever you read a statistical re-
sult, be sure that you understand the definitions of the terms used
to describe the result and relate those terms to the definitions that
you already know. In some cases year is a categorical variable, in oth-
ers it is a discrete variable, and in others it is a continuous variable.
Many surveys will report that "respondents feel better, the same, or
worse" about a particular subject. Although better, same, and worse
have a natural order to them, they do not have numerical values.

Glossary

Census A survey that includes all members of the population.

Continuous variable A (quantitative) variable that can assume any
numerical value over a certain interval or intervals.

Cross-section data Data collected on different elements at the
same point in time or for the same period of time.

Supplementary Exercises 19

Data or data set Collection of observations or measurements on
a variable.

Descriptive statistics Collection of methods for organizing, display-
ing, and describing data using tables, graphs, and summary measures.

Discrete variable A (quantitative) variable whose values are
countable.

Element or member A specific subject or object included in a
sample or population.

Inferential statistics Collection of methods that help make deci-
sions about a population based on sample results.

Observation or measurement The value of a variable for an element.

Population or target population The collection of all elements
whose characteristics are being studied.

Qualitative or categorical data Data generated by a qualitative
variable.

Qualitative or categorical variable A variable that cannot assume
numerical values but is classified into two or more categories.

Quantitative data Data generated by a quantitative variable.

Quantitative variable A variable that can be measured numerically.

Supplementary Exercises

1.30 The following table gives the total number of DVDs sold at retail stores between 2003 and 2008.

Year

U.S. Retail Sales of DVDs
(millions of DVDs)

2003

11.3

2004

15.1

2005

16.0

2006

16.3

2007

15.8

2008

15.2

Source: SNL Kagan.

Describe the meanings of a variable, a measurement, and a data set with reference to this table.

1.31 The following table gives the total 2009 payrolls (on the opening day of the 2009 season, rounded to
the nearest million dollars) for eight National League baseball teams.

Total Payroll

Team

(millions of dollars)

Atlanta Braves

Chicago Cubs

135

Florida Marlins

Los Angeles Dodgers

100

New York Mets

149

Philadelphia Phillies

113

Pittsburg Pirates

San Francisco Giants

Describe the meanings of a member, a variable, a measurement, and a data set with reference to this table.
1.32 Refer to Exercises 1.30 and 1.31. Classify these data sets as either cross-section or time-series data.

Random sample A sample drawn in such a way that each element
of the population has some chance of being included in the sample.

Representative sample A sample that contains the same charac-
teristics as the corresponding population.

Sample A portion of the population of interest.

Sample survey A survey that includes elements of a sample.

Simple random sampling If all samples of the same size selected
from a population have the same chance of being selected, it is called
simple random sampling. Such a sample is called a simple random
sample.

Statistics Group of methods used to collect, analyze, present, and
interpret data and to make decisions.

Survey Collection of data on the elements of a population or
sample.

Time-series data Data that give the values of the same variable
for the same element at different points in time or for different pe-
riods of time.

Variable A characteristic under study or investigation that assumes
different values for different elements.

20 Chapter 1 Introduction

1.33 Indicate whether each of the following examples refers to a population or to a sample.

a. A group of 25 patients selected to test a new drug

b. Total items produced on a machine for each year from 1995 to 2009

c. Yearly expenditures on clothes for 50 persons

d. Number of houses sold by each of the 10 employees of a real estate agency during 2009

1.34 Indicate whether each of the following examples refers to a population or to a sample.

a. Salaries of CEOs of all companies in New York City

b. Five hundred houses selected from a city

c. Gross sales for 2009 of four fast-food chains

d. Annual incomes of all 33 employees of a restaurant

1.35 State which of the following is an example of sampling with replacement and which is an example
of sampling without replacement.

a. Selecting 10 patients out of 100 to test a new drug

b. Selecting one professor to be a member of the university senate and then selecting one professor
from the same group to be a member of the curriculum committee

1.36 State which of the following is an example of sampling with replacement and which is an example
of sampling without replacement.

a. Selecting seven cities to market a new deodorant

b. Selecting a high school teacher to drive students to a lecture in March, then selecting a teacher
from the same group to chaperone a dance in April

1.37 The number of shoe pairs owned by six women is 8, 14, 3, 7, 10, and 5, respectively. Let x denote
the number of shoe pairs owned by a woman. Find:

a. tx b. (tx) 2 c. tx 2

1.38 The number of restaurants in each of five small towns is 4, 12, 8, 10, and 5, respectively. Let y de-
note the number of restaurants in a small town. Find:

a. 2y b. (2y) 2 c. 2y 2

1.39 The following table lists five pairs of m and /values.

Compute the value of each of the following:

a. tm b. 2/ 2 c. 2m/ d. 2m 2 / e. 2m 2

1.40 The following table lists six pairs of x and y values.

Compute the value of each of the following:

a. 2y b. 2x 2 c. txy d. tx 2 y e. 2y 2

Self-Review Test

1. A population in statistics means a collection of all

a. men and women

b. subjects or objects of interest

c. people living in a country

2. A sample in statistics means a portion of the

a. people selected from the population of a country

b. people selected from the population of an area

c. population of interest

3. Indicate which of the following is an example of a sample with replacement and which is a sample
without replacement.

a. Five friends go to a livery stable and select five horses to ride (each friend must choose a differ-
ent horse).

b. A box contains five marbles of different colors. A marble is drawn from this box, its color is recorded,
and it is put back into the box before the next marble is drawn. This experiment is repeated 12 times.

Mini-Project 21

4. Indicate which of the following variables are quantitative and which are qualitative. Classify the quan-
titative variables as discrete or continuous.

a. Women's favorite TV programs

b. Salaries of football players

c. Number of pets owned by families

d. Favorite breed of dog for each of 20 persons

5. The following table contains data on the 10 biggest Nasdaq losers of October 2008. The first column
in the table contains the names of the companies and their NASDAQ symbols, and the second column
gives the returns for the stocks of these companies for the month of October 2008.

Company (NASDAQ Symbol) October 2008 Return

Smurfit-Stone Container (SSCC)

-71

Bruker Corporation (BRKR)

-69

Savient Pharmaceuticals (SVNT)

-68

Parexel International Corp (PRXL)

-63

Global Industries (GLBL)

-63

Rigel Pharmaceuticals (RIGL)

-62

Liberty Media Interactive (LINTA)

-62

YRC Worldwide (YRCW)

-61

Grupo Financiero Galicia S.A. (GOAL)

-61

Bare Escentuals (BARE)

-61

Source: The Motley Fool.

Explain the meaning of a member, a variable, a measurement, and a data set with reference to this table.

6. The number of credit cards possessed by five couples is 2, 5, 3, 12, and 7, respectively. Let x be the
number of credit cards possessed by a couple. Find:

a. %x b. (2x) 2 c. 2x 2

7. The following table lists five pairs of m and /values.

Calculate

a. 2m b. 2/ c. 2m 2 d. 2m/ e. 2m 2 / f. 2/ 2

Mini-Project

■ MINI-PROJECT 1-1

In this mini-project, you are going to obtain a data set of interest to you that you will use for mini-projects
in some of the other chapters. The data set should contain at least one qualitative variable and one quan-
titative variable, although having two of each will be necessary in some cases. Ask your instructor how
many variables you should have. A good-size data set to work with should contain somewhere between
50 and 100 observations.

Here are some examples of the procedures to use to obtain data:

1. Take a random sample of used cars and collect data on them. You may use Web sites like Cars.com,
AutoTrader.com, and so forth. Quantitative variables may include the price, mileage, and age of a car.
Categorical variables may include the model, drive train (front wheel, rear wheel, and so forth), and
type (compact, SUV, minivan, and so forth). You can concentrate on your favorite type of car, or look
at a variety of types.

2. Examine the real estate ads in your local newspaper or online and obtain information on houses
for sale that may include listed price, number of bedrooms, lot size, living space, town, type of house,
number of garage spaces, and number of bathrooms.

3. Use an almanac or go to a government Web site, such as www.census.gov or www.cdc.gov, to ob-
tain information for each state. Quantitative variables may include income, birth and death rates, cancer

22 Chapter 1 Introduction

incidence, and the proportion of people living below the poverty level. Categorical variables may in-
clude things like the region of the country where each state is located and which party won the state
governorship in the last election. You can also collect this information on a worldwide level and use the
continent or world region as a categorical variable.

4. Take a random sample of students and ask them questions such as:

• How much money did you spend on books last semester?

• How many credit hours did you take?

• What is your major?

5. If you are a sports fan, you can use an almanac or sports Web site to obtain statistics on a random
sample of athletes. You can look at sport-specific statistics such as home runs, runs batted in, position,
left-handed/right-handed, and so forth in baseball, or you could collect information to compare differ-
ent sports by gathering information on salary, career length, weight, and so forth.

Once you have collected the information, write a brief report that includes answers to the following
tasks/questions:

a. Describe the variables on which you have collected information.

b. Describe a reasonable target population for the sample you used.

c. Is your sample a random sample from this target population?

d. Do you feel that your sample is representative of this population?

e. Is this an example of sampling with or without replacement?

f. For each quantitative variable, state whether it is continuous or discrete.

g. Describe the meaning of an element, a variable, and a measurement for this data set.

h. Describe any problems you faced in collecting these data.

i. Were any of the data values unusable? If yes, explain why.

Your instructor will probably want to see a copy of the data you collected. If you are using statisti-
cal software in the class, enter the data into that software and submit a copy of the data file. If you are
using a handheld technology calculator, such as a graphing calculator, you will probably have to print out
a hard copy version of the data set. Save this data set for projects in future chapters.

ECHNOLOGY

INSTRUCTION

Entering and Saving Data

L3 1

L1(15 =

Screen 1.1

£1

L2(1)=50

Screen 1.2

Technology makes the process of data analysis much easier and faster. Therefore, you need to be able to enter
the data, proofread them, and revise them. Moreover, you can save the data and retrieve them for use at a
later date.

Entering Data in a List

1. On the TI-84, variables are referred to as lists.

2. In order to enter data into the TI-84, you first need to decide whether you want to save the
data for later use or just use it in the immediate future.

3. If you will be just using it in the immediate future, select STAT >EDIT >SetUpEditor,
and then press Enter. This will set up the editor to use "scratch" lists LI, L2, L3, L4,
L5, and L6 (Screen 1.1). Now select STAT >EDIT >Edit and start typing your numeric
data into the column or columns, pressing ENTER after each entry (Screen 1.2). Note
that the TI-84 calculator will not handle nonnumeric data.

Changing List Names/Establishing Visible Lists

1. The TI-84 has only six "scratch" lists. In some cases you will be using your data at a later
date. You can rename a list so that you do not have to reenter the data. Select STAT
>EDIT >SetUpEditor, and then type in the names of your variables separated by commas
(Screens 1.3 and 1.4). Names can be one to five letters long, with the letters found in

Technology Instructi on 2 5

green on your keypad. You can use the green ALPHA key with each letter, or press
A-LOCK (2nd > ALPHA) while you are typing the name. To turn off A-LOCK, press
ALPHA.

2. You can use the arrow keys to move around and go back to a cell to edit its contents.
When editing values, you will need to press ENTER for the changes to take effect.

Screen i .3 3. SetUpEditor determines what lists are displayed in the editor. Changing what SetUpEditor

displays does not delete any lists. Your lists remain in storage when the calculator is
turned off.

Numeric Operations on Lists

1. To calculate the sum of the values in a list, such as LI, select LIST (2nd > STAT) >
MATH > sum(. Enter the name of the list (e.g., 2nd > 1 for LI), then type the right
parenthesis. Press ENTER. (See Screens 1.5 and 1.6.)

2. If you need to find the sum of values and the square of the sum denoted by (Sx) 2 , you
can use the same instructions as in item 1. However, just before you press ENTER,
press the x 2 button. If you wish to square each value and calculate the sum of the
squared values, which is denoted by Sjt, press the x 2 button after entering the name of
the list but prior to typing the right parenthesis. Screen 1.6 shows the appearance of
these two processes.

SetUpEditor EX1,
EX2

□one

EHi

EK2

EK2C1) =50

NRMES OPS mac

1 : min<

2:max<

3: meant;

4: nedianC

iflsupi<

bTprodC

74stdDeu<

sum(l_i >

2S5

sum(l_i > £

81225

sun i. L 1 E >

15655

Screen 1.5

Screen 1.6

Entering and Saving Data

Start Minitab. You will see the computer screen divided into two parts — Session
window, which will contain numeric output; a Worksheet, which looks similar to a
spreadsheet, where you will enter your data (see Screen 1.7); and a Project Manager
window. You are allowed to have multiple worksheets within a project.

C3-T

year

sales

employee

2008

J. Smith

2008

A. Jones

2009

J. Smith

2009

A. Jones

Screen 1.7

2. Use the mouse or the arrow keys to select where you want to start entering your data.
Each column in the worksheet corresponds to a variable, so you can enter only one kind
of data into a given column. Data can be numeric, text, or date/time. The rectangles in
the worksheet are called cells, and the cells are organized into columns such as CI, C2,
and so on, each with rows 1, 2, and so forth. Note that if a column contains text data,
"-T" will be added to the column heading.

3. The blank row between the column labels and the first row is for variable names. In
these blank cells, you can type the names of variables.

4. You can change whether you are typing the data across in rows or down in columns
by clicking the direction arrow at the top left of the worksheet (also shown in
Screen 1.7).

24 Chapter 1 Introduction

5. Click on a cell and begin typing. Press Enter when you are finished with that cell.

6. If you need to revise an entry, go to that cell with the mouse or the arrow keys and begin
typing. Press Enter to put the revised entry into the cell.

7. When you are done, select File >Save Project As to save your work for the first time as
a file on your computer. Note that Minitab will automatically assign the file extension
.mpj to your work after you choose the filename.

8. Try entering the following data into Minitab:

January

.08

February

.06

March

.07

Name the columns Month, Sales, Increase. Save the result as the file test.mpj.

9. To retrieve the file, select File >Open and select the file test.mpj.

10. If you are already in Minitab and you want to start a new worksheet, select File >New
and choose Worksheet. Whenever you save a project, Minitab will automatically save all
of the worksheets in the project.

Calculator

C3 employee

Select

Help

Screen 1.8

Creating New Columns from Existing Columns

In some circumstances, such as when you need to calculate Sx 2 or 2xy, you will need to
calculate a new column of values using one or more existing columns. To calculate a column
containing the squares of the values in the column Sales as shown in Screen 1.7,

Store result in variable: | c4
Expression:

'sales'**2

Functions:

.7 i Bj 9j t j _-:<>:

All functions

Absolute value

Antilog

Any

; : l

And ;

Or |

Arcsine

Arccosine

Arctangent

Not |

Select

l~~ Assign as a formula

Cancel

1. Select Calc > Calculator.

2. Type the name of the column to contain the new values
(such as C4) in the Store result in variable: box.

3. Click inside the Expression: box, click C2 Sales in the
column to the left of the Expression: box, and click Select.
Click on the exponentiation (**) button. Type 2 after the
two asterisks in the Expression: box. Click OK. (See
Screen 1.8.)

4. The numbers 1225, 1444, 2500, and 2304 should appear
in C4.

Calculating the Sum of a Column

1. To calculate the sum of the values in a column, select Calc > Column Statistics, which
will produce a dialog box. From the Statistic list, select Sum.

2. Click in the Input Variable: box. The list of variables will appear in the left portion of the
dialog box. Click on the variable you wish to sum, then click Select. (See Screen 1.9.)

Technology Instruction 25

Column Statistics

Cl year
C2 sales
C4

Statistic
[* Sum
C Mean

C Standard deviation
C Minimum
f* Maximum
C Range

C Median

(~ Sum of squares

r N total

f~ N nonmissing

i™ N missing

Input variable: [c4~

Store result in: f

Select

Help

Screen 1.9

(Optional)

Cancel

Session f^~)(~n)

Sum of C4

Sun of C4 = 7473

Screen 1.10

3. Click OK. The result will appear in the Session window. (See Screen 1.10.)

Entering and Saving Data in Excel

1. Start Excel.

2. Use the mouse or the arrow keys to select where you want to start entering your data.
Data can be numeric or text. The rectangles are called cells, and the cells are collectively
known as a spreadsheet.

3. You can format your data by selecting the cells that you want to format, then selecting
Format > Cells, and then choosing whether you want to format a number, align text, and
so forth. For common formatting tasks, you have icons on the toolbar, such as a dollar
sign ($) to format currency, a percent sign (%) to format numbers as percents, and icons
representing left-, center-, and right-aligned text to change your alignment.

4. If you need to revise an entry, go to that cell with the mouse or the arrow keys. You can
retype the entry or you can edit it. To edit it, double-click on the cell and use the arrow
keys and the backspace key to help you revise the entry, then press Enter to put the revised
entry into the cell.

5. When you are done, select File >Save As to save your work for the first time as a file on
your computer. Note that Excel will automatically assign the file extension .xls to your
work after you choose the filename.

6. Try entering the following data into Excel:

Format it to look like this:

January

.08

February

.06

March

.07

January

$52.00

February

$48.00

March

$49.00

Save the result as the file test.xls.

26 Chapter 1 Introduction

■

year

sales

employee

2008

35 J. Smith

2008

38 A. Jones

2009

50 J. Smith

2009

48 A. Jones

7. To retrieve the file, select File > Open and select the file test.xls.

Screen 1.11 contains the data from the Minitab example (Screen 1.7) as displayed in
Excel.

Screen 1.11

SUM

T (j» X ✓ fr =B1 A 2

January

$52.00
■ — — 1

=Bl*2

February

$48.00

March

$49.00

Screen 1.12

Creating New Columns from Existing Columns

Many times, such as when you need to calculate Sjt 2 or Xxy, you will need to calculate a new
column of values using one or more existing columns. To calculate the squares of the values in
cells Bl to B3 and place them in cells Dl to D3:

1. Click on cell Dl.

2. Type =B1 A 2. Press Enter. (See Screen 1.12.)

While still on cell Dl, select Edit > Copy. Highlight cells D2 and D3.
Select Edit > Paste.

3. The numbers 2704, 2304, and 2401 should appear in Dl to D3.

Calculating the Sum of a Column

SUM

(j> X Ji =SUM(D1:D3}

January

$52.00

2704

February

$4S.O0

2304

March

$49.00

2401

=SUM(B, ^)

| SUMinumberl, [number^]. ..,] |

To calculate the sum of the values in a column, go to the
empty cell below the values you wish to find the sum of.
Click the sigma (X) button in the upper-right portion of the
Home tab. This will enter the Sum function into the cell
along with the list of cells involved in the sum. (Note: If the
list is incorrect, you can type any changes.) Press Enter. (See
Screen 1.13.)

Screen 1.13

TECHNOLOGY ASSIGNMENTS

TAl.l The following table gives the names, hours worked, and salary for the past week for five workers.

Name

Hours Worked

Salary ($)

John

925

Shannon

2583

Kathy

1255

David

2090

Steve

1020

a. Enter these data into the spreadsheet. Save the data file as WORKER. Exit the session or program.
Then restart the program or software and retrieve the file WORKER.

b. Print a hard copy of the spreadsheet containing data you entered.

TA1.2 Refer to data on 2007 charitable givings of six retailers given in Table 1.1. Enter those data into
the spreadsheet and save this file as GIVINGS.

Chapter

Organizing and Graphing Data

What future careers interest high school students the most? Information technology? Business/
management? Health care? Or some other careers? A sample survey of high school students showed
the percentage of students intending to enter each of these fields. In this sample of 1023 high school
students, 15% said that they planned to pursue a career in health care. (See Case Study 2-1)

In addition to thousands of private organizations and individuals, a large number of U.S. government
agencies (such as the Bureau of the Census, the Bureau of Labor Statistics, the National Agricultural
Statistics Service, the National Center for Education Statistics, the National Center for Health Statistics,
and the Bureau of Justice Statistics) conduct hundreds of surveys every year. The data collected from
each of these surveys fill hundreds of thousands of pages. In their original form, these data sets may
be so large that they do not make sense to most of us. Descriptive statistics, however, supplies the
techniques that help to condense large data sets by using tables, graphs, and summary measures. We
see such tables, graphs, and summary measures in newspapers and magazines every day. At a glance,
these tabular and graphical displays present information on every aspect of life. Consequently, descriptive
statistics is of immense importance because it provides efficient and effective methods for summa-
rizing and analyzing information.

This chapter explains how to organize and display data using tables and graphs. We will learn
how to prepare frequency distribution tables for qualitative and quantitative data; how to construct bar
graphs, pie charts, histograms, and polygons for such data; and how to prepare stem-and-leaf displays.

2.1 Raw Data

2.2 Organizing and Graphing
Qualitative Data

Case Study 2-1 Career

Choices for High School
Students

Case Study 2-2 In or Out in
30 Minutes

2.3 Organizing and Graphing
Quantitative Data

Case Study 2-3 Morning
Grooming

2.4 Shapes of Histograms

2.5 Cumulative Frequency
Distributions

2.6 Stem-and-Leaf Displays

2.7 Dotplots

28 Chapter 2 Organizing and Graphing Data

2.1 Raw Data

When data are collected, the information obtained from each member of a population or sample
is recorded in the sequence in which it becomes available. This sequence of data recording is
random and unranked. Such data, before they are grouped or ranked, are called raw data.

Definition

Raw Data Data recorded in the sequence in which they are collected and before they are
processed or ranked are called raw data.

Suppose we collect information on the ages (in years) of 50 students selected from a uni-
versity. The data values, in the order they are collected, are recorded in Table 2. 1 . For instance,
the first student's age is 21, the second student's age is 19 (second number in the first row), and
so forth. The data in Table 2.1 are quantitative raw data.

Table 2.1 Ages of 50 Students

Suppose we ask the same 50 students about their student status. The responses of the students
are recorded in Table 2.2. In this table, F, SO, J, and SE are the abbreviations for freshman,
sophomore, junior, and senior, respectively. This is an example of qualitative (or categorical)
raw data.

Table 2.2 Status of 50 Students

The data presented in Tables 2.1 and 2.2 are also called ungrouped data. An ungrouped
data set contains information on each member of a sample or population individually.

2.2 Organizing and Graphing Qualitative Data

This section discusses how to organize and display qualitative (or categorical) data. Data sets
are organized into tables, and data are displayed using graphs.

2.2.1 Frequency Distributions

A sample of 100 students enrolled at a university were asked what they intended to do after
graduation. Forty-four said they wanted to work for private companies/businesses, 16 said
they wanted to work for the federal government, 23 wanted to work for state or local governments,

2.2 Organizing and Graphing Qualitative Data 29

and 17 intended to start their own businesses. Table 2.3 lists the types of employment and the
number of students who intend to engage in each type of employment. In this table, the vari-
able is the type of employment, which is a qualitative variable. The categories (representing
the type of employment) listed in the first column are mutually exclusive. In other words,
each of the 100 students belongs to one and only one of these categories. The number of stu-
dents who belong to a certain category is called the frequency of that category. A frequency
distribution exhibits how the frequencies are distributed over various categories. Table 2.3
is called a frequency distribution table or simply a frequency table.

Variable ■

Category -

Table 2.3 Type of Employment Students
Intend to Engage In

Type of Employment

Private companies/businesses

Federal government

State/local government
Own business

Number of
Students

23
17

Sum = 100

Frequency column

Frequency

Definition

Frequency Distribution for Qualitative Data A frequency distribution for qualitative data lists all
categories and the number of elements that belong to each of the categories.

Example 2-1 illustrates how a frequency distribution table is constructed for qualitative data.

EXAMPLE 2-1

A sample of 30 employees from large companies was selected, and these employees were
asked how stressful their jobs were. The responses of these employees are recorded below,
where very represents very stressful, somewhat means somewhat stressful, and none stands
for not stressful at all.

somewhat

very

somewhat
somewhat

none

somewhat
somewhat
very
very

somewhat
somewhat
none

somewhat
very

very
very
very

somewhat
somewhat

very

somewhat
none
very
none

none

somewhat
somewhat
none

somewhat

Construct a frequency distribution table for these data.

Solution Note that the variable in this example is how stressful is an employee's job. This
variable is classified into three categories: very stressful, somewhat stressful, and not stressful at
all. We record these categories in the first column of Table 2.4. Then we read each employee's
response from the given data and mark a tally, denoted by the symbol |, in the second column
of Table 2.4 next to the corresponding category. For example, the first employee's response is
that his or her job is somewhat stressful. We show this in the frequency table by marking a tally
in the second column next to the category somewhat. Note that the tallies are marked in blocks
of five for counting convenience. Finally, we record the total of the tallies for each category in
the third column of the table. This column is called the column of frequencies and is usually de-
noted by / The sum of the entries in the frequency column gives the sample size or total fre-
quency. In Table 2.4, this total is 30, which is the sample size.

Constructing a

frequency distribution table

for qualitative data.

30 Chapter 2 Organizing and Graphing Data

Table 2.4 Frequency Distribution of Stress on Job

Stress on Job

Tally

Frequency (/)

Very

Somewhat

mini

None

Sum = 30

2.2.2 Relative Frequency and Percentage Distributions

The relative frequency of a category is obtained by dividing the frequency of that category by
the sum of all frequencies. Thus, the relative frequency shows what fractional part or propor-
tion of the total frequency belongs to the corresponding category. A relative frequency distri-
bution lists the relative frequencies for all categories.

Calculating Relative Frequency of a Category

Relative frequency of a category =

Frequency of that category

Sum of all frequencies

The percentage for a category is obtained by multiplying the relative frequency of that cat-
egory by 100. A percentage distribution lists the percentages for all categories.

Calculating Percentage

Percentage = (Relative frequency) • 100

■ EXAMPLE 2-2

Determine the relative frequency and percentage distributions for the data of Table 2.4.

Solution The relative frequencies and percentages from Table 2.4 are calculated and listed
in Table 2.5. Based on this table, we can state that .333, or 33.3%, of the employees said that
their jobs are very stressful. By adding the percentages for the first two categories, we can
state that 80% of the employees said that their jobs are very or somewhat stressful. The other
numbers in Table 2.5 can be interpreted the same way.

Notice that the sum of the relative frequencies is always 1.00 (or approximately 1.00 if the
relative frequencies are rounded), and the sum of the percentages is always 100 (or
approximately 100 if the percentages are rounded).

Table 2.5 Relative Frequency and Percentage Distributions
of Stress on Job

Stress on Job

Relative Frequency

Percentage

Very

10/30 = .333

.333(100) = 33.3

Somewhat

14/30 = .467

.467(100) = 46.7

None

6/30 = .200

.200(100) = 20.0

firm

Constructing relative
frequency and percentage
distributions.

Sum = 1.000

Sum = 100

USA TODAY Snapshots®

Career choices for
high school students

Source: RJdgfd survey of 1,033 high
y jdCS ^hool students 14 to 18 years old.

Margin of error ±1 percentage points

CAREER
CHOICES
FOR HIGH
SCHOOL
STUDENTS

By Ja e Yang .and Sam Wa rd , USA TODAY

The accompanying chart shows a bar graph indicating the career choices of high school students. Note that
in this chart, the bars are drawn horizontally. The chart shows the percentage of high school students in a
sample of 1023 who plan to go into different careers. Note that the percentages in the chart add up to 72%.
Therefore, 28% of these students mentioned other careers.

Source: USA Today, April 1 6, 2009.
Copyright © 2009, USA Today. Chart
reproduced with permission.

2.2.3 Graphical Presentation of Qualitative Data

All of us have heard the adage "a picture is worth a thousand words." A graphic display can re-
veal at a glance the main characteristics of a data set. The bar graph and the pie chart are two
types of graphs that are commonly used to display qualitative data.

Bar Graphs

To construct a bar graph (also called a bar chart), we mark the various categories on the hor-
izontal axis as in Figure 2.1. Note that all categories are represented by intervals of the same
width. We mark the frequencies on the vertical axis. Then we draw one bar for each category
such that the height of the bar represents the frequency of the corresponding category. We leave
a small gap between adjacent bars. Figure 2.1 gives the bar graph for the frequency distribution
of Table 2.4.

» 12
o

§ 9

£ 6

Definition

Bar Graph A graph made of bars whose heights represent the frequencies of respective categories
is called a bar graph.

The bar graphs for relative frequency and percentage distributions can be drawn simply
by marking the relative frequencies or percentages, instead of the frequencies, on the verti-
cal axis.

Sometimes a bar graph is constructed by marking the categories on the vertical axis and
the frequencies on the horizontal axis. Case Study 2-1 presents such an example.

Stress on job

Figure 2.1 Bar graph
for the frequency distri-
bution of Table 2.4.

IN OR
OUT IN 30
MINUTES

USA TODAY Snapshots®

In or out in 30 minutes

How long do you take
to make your decision
on candidates after
interviews?

Less than
30 minutes
47%

More than „
an hour 30 minutes
to an hour

19%

34%

Source; Development Dimensions
International (DDI) survey or 1 .91
interviewers. Margin of error ±3
percentage points.

By Jae Yang jnd Alejandro Gonzalez. USA TODAY

Source. USA Today, March 19, T h e accompanying chart shows a pie chart indicating the time it takes interviewers to make decisions after

Today. Chart reproduced with tnev lnterview candidates. For example, 47% of the interviewers said that they decide within 30 minutes

permission. after interviewing a candidate. The pie chart is based on a sample survey of 1910 interviewers.

Pie Charts

A pie chart is more commonly used to display percentages, although it can be used to display
frequencies or relative frequencies. The whole pie (or circle) represents the total sample or popu-
lation. Then we divide the pie into different portions that represent the different categories.

Definition

Pie Chart A circle divided into portions that represent the relative frequencies or percentages
of a population or a sample belonging to different categories is called a pie chart.

As we know, a circle contains 360 degrees. To construct a pie chart, we multiply 360 by
the relative frequency of each category to obtain the degree measure or size of the angle for the
corresponding category. Table 2.6 shows the calculation of angle sizes for the various categories
of Table 2.5.

Table 2.6 Calculating Angle Sizes for the Pie Chart

Stress on Job

Relative Frequency

Angle Size

Very

.333

360(.333) = 119.88

Somewhat

.467

360(.467) = 168.12

None

.200

360(.200) = 72.00

Sum = 1.000

Sum = 360

Figure 2.2 shows the pie chart for the percentage distribution of Table 2.5, which uses the
angle sizes calculated in Table 2.6.

2.2 Organizing and Graphing Qualitative Data

Figure 2.2 Pie chart for the percentage distribution
of Table 2.5.

EXERCISES

CONCEPTS AND PROCEDURES

2.1 Why do we need to group data in the form of a frequency table? Explain briefly.

2.2 How are the relative frequencies and percentages of categories obtained from the frequencies of
categories? Illustrate with the help of an example.

2.3 The following data give the results of a sample survey. The letters A, B, and C represent the three
categories.

a. Prepare a frequency distribution table.

b. Calculate the relative frequencies and percentages for all categories.

c. What percentage of the elements in this sample belong to category B?

d. What percentage of the elements in this sample belong to category A or C?

e. Draw a bar graph for the frequency distribution.

2.4 The following data give the results of a sample survey. The letters Y, N, and D represent the three
categories.

a. Prepare a frequency distribution table.

b. Calculate the relative frequencies and percentages for all categories.

c. What percentage of the elements in this sample belong to category Y?

d. What percentage of the elements in this sample belong to category N or D?

e. Draw a pie chart for the percentage distribution.

■ APPLICATIONS

2.5 The data on the status of 50 students given in Table 2.2 of Section 2. 1 are reproduced here.

a. Prepare a frequency distribution table.

b. Calculate the relative frequencies and percentages for all categories.

c. What percentage of these students are juniors or seniors?

d. Draw a bar graph for the frequency distribution.

34 Chapter 2 Organizing and Graphing Data

2.6 Thirty adults were asked which of the following conveniences they would find most difficult to do
without: television (T), refrigerator (R), air conditioning (A), public transportation (P), or microwave (M).
Their responses are listed below.

a. Prepare a frequency distribution table.

b. Calculate the relative frequencies and percentages for all categories.

c. What percentage of these adults named refrigerator or air conditioning as the convenience that they
would find most difficult to do without?

d. Draw a bar graph for the relative frequency distribution.

2.7 In a USA TODAY survey, registered dietitians with the American Dietetic Association were asked,
"What is the major reason people want to lose weight?" The responses were classified as Health (H),
Cosmetic (C), and Other (O). Suppose a random sample of 20 dietitians is taken and these dietitians are
asked the same question. Their responses are as follows.

HHCHOCCHCO
OHCHHCHH OH

a. Prepare a frequency distribution table.

b. Compute the relative frequencies and percentages for all categories.

c. What percentage of these dietitians gave Health as the major reason for people to lose weight?

d. Draw a pie chart for the percentage distribution.

2.8 The following data show the method of payment by 16 customers in a supermarket checkout line.
Here, C refers to cash, CK to check, CC to credit card, and D to debit card, and O stands for other.

C CK CK C CC D O C

CK CC D CC C CK CK CC

a. Construct a frequency distribution table.

b. Calculate the relative frequencies and percentages for all categories.

c. Draw a pie chart for the percentage distribution.

2.9 In a lanuary 27, 2009 Harris Poll (Harris Interactive Inc, January 2009), U.S. adults who follow at
least one sport were asked to name their favorite sport. The table below summarizes their responses.

Favorite Sport

Percentage of Responses

Pro football

Baseball

College football

Auto racing

Men's pro basketball

Hockey

Men's college basketball

Note that these percentages add up to 83%. The remaining respondents named other sports, which can be
denoted by Other. Draw a pie chart for this distribution.

2.10 In exit polls taken during the 2008 presidential election, voters were asked to provide their educa-
tion levels. The table below summarizes their responses.

Education Level

Percentage of Responses

Not a high school graduate

High school graduate

Some college education

College graduate

Post graduate education

Source: New York Times, November 5, 2008.

Draw a bar graph to display these data.

2.3 Organizing and Graphing Quantitative Data

In the previous section we learned how to group and display qualitative data. This section ex-
plains how to group and display quantitative data.

2.3.1 Frequency Distributions

Table 2.7 gives the weekly earnings of 100 employees of a large company. The first column
lists the classes, which represent the (quantitative) variable weekly earnings. For quantitative
data, an interval that includes all the values that fall within two numbers — the lower and upper
limits — is called a class. Note that the classes always represent a variable. As we can observe,
the classes are nonoverlapping; that is, each value on earnings belongs to one and only one
class. The second column in the table lists the number of employees who have earnings within
each class. For example, 9 employees of this company earn $801 to $1000 per week. The num-
bers listed in the second column are called the frequencies, which give the number of values
that belong to different classes. The frequencies are denoted by f.

Variable

Third class

Table 2.7 Weekly Earnings of 100 Employees
of a Company

Lower limit of
the sixth class

Weekly Earnings
(dollars)

Number of Employees
/

801 to 1000

1001 to 1200

1 1201 to 1400 1

m < —

1401 to 1600

1601 to 1800

|l80l|to|2000h

Upper limit of

Frequency column

J Frequency of
1 the third class

the sixth class

For quantitative data, the frequency of a class represents the number of values in the data
set that fall in that class. Table 2.7 contains six classes. Each class has a lower limit and an
upper limit. The values 801, 1001, 1201, 1401, 1601, and 1801 give the lower limits, and the
values 1000, 1200, 1400, 1600, 1800, and 2000 are the upper limits of the six classes, respec-
tively. The data presented in Table 2.7 are an illustration of a frequency distribution table for
quantitative data. Whereas the data that list individual values are called ungrouped data, the data
presented in a frequency distribution table are called grouped data.

Definition

Frequency Distribution for Quantitative Data A frequency distribution for quantitative data lists
all the classes and the number of values that belong to each class. Data presented in the form of
a frequency distribution are called grouped data.

To find the midpoint of the upper limit of the first class and the lower limit of the second
class in Table 2.7, we divide the sum of these two limits by 2. Thus, this midpoint is

1000 + 1001
= 1000.5

The value 1000.5 is called the upper boundary of the first class and the lower boundary of the
second class. By using this technique, we can convert the class limits of Table 2.7 to class
boundaries, which are also called real class limits. The second column of Table 2.8 lists the
boundaries for Table 2.7.

36 Chapter 2 Organizing and Graphing Data

Definition

Class Boundary The class boundary is given by the midpoint of the upper limit of one class and
the lower limit of the next class.

The difference between the two boundaries of a class gives the class width. The class width
is also called the class size.

Finding Class Width

Class width = Upper boundary — Lower boundary

Thus, in Table 2.8,

Width of the first class = 1000.5 - 800.5 = 200

The class widths for the frequency distribution of Table 2.7 are listed in the third column of
Table 2.8. Each class in Table 2.8 (and Table 2.7) has the same width of 200.

The class midpoint or mark is obtained by dividing the sum of the two limits (or the two
boundaries) of a class by 2.

Calculating Class Midpoint or Mark

Lower limit + Upper limit
Class midpoint or mark =

Thus, the midpoint of the first class in Table 2.7 or Table 2.8 is calculated as follows:

801 + 1000

Midpoint of the first class = = 900.5

The class midpoints for the frequency distribution of Table 2.7 are listed in the fourth column
of Table 2.8.

Table 2.8 Class Boundaries, Class Widths, and Class Midpoints for Table 2.7

Class Limits

Class Boundaries

Class Width

Class Midpoint

801 to 1000

800.5 to less than 1000.5

200

900.5

1001 to 1200

1000.5 to less than 1200.5

200

1100.5

1201 to 1400

1200.5 to less than 1400.5

200

1300.5

1401 to 1600

1400.5 to less than 1600.5

200

1500.5

1601 to 1800

1600.5 to less than 1800.5

200

1700.5

1801 to 2000

1800.5 to less than 2000.5

200

1900.5

Note that in Table 2.8, when we write classes using class boundaries, we write to less
than to ensure that each value belongs to one and only one class. As we can see, the upper
boundary of the preceding class and the lower boundary of the succeeding class are the same.

2.3.2 Constructing Frequency Distribution Tables

When constructing a frequency distribution table, we need to make the following three major
decisions.

2.3 Organizing and Graphing Quantitative Data 37

Number of Classes

Usually the number of classes for a frequency distribution table varies from 5 to 20, depending
mainly on the number of observations in the data set. 1 It is preferable to have more classes as
the size of a data set increases. The decision about the number of classes is arbitrarily made by
the data organizer.

Class Width

Although it is not uncommon to have classes of different sizes, most of the time it is preferable
to have the same width for all classes. To determine the class width when all classes are the same
size, first find the difference between the largest and the smallest values in the data. Then, the ap-
proximate width of a class is obtained by dividing this difference by the number of desired classes.

Calculation of Class Width

Largest value — Smallest value

Approximate class width

Number of classes

Usually this approximate class width is rounded to a convenient number, which is then used
as the class width. Note that rounding this number may slightly change the number of classes
initially intended.

Lower Limit of the First Class or the Starting Point

Any convenient number that is equal to or less than the smallest value in the data set can be
used as the lower limit of the first class.

Example 2-3 illustrates the procedure for constructing a frequency distribution table for
quantitative data.

■ EXAMPLE 2-3

The following data give the total number of iPods® sold by a mail order company on each of
30 days. Construct a frequency distribution table.

Solution In these data, the minimum value is 5, and the maximum value is 29. Suppose we
decide to group these data using five classes of equal width. Then,

29-5

Approximate width of each class = — - — = 4.8

Now we round this approximate width to a convenient number, say 5. The lower limit of the
first class can be taken as 5 or any number less than 5. Suppose we take 5 as the lower limit
of the first class. Then our classes will be

5-9, 10-14, 15-19, 20-24, and 25-29

We record these five classes in the first column of Table 2.9.

Constructing a

frequency distribution table for
quantitative data.

'One rale to help decide on the number of classes is Sturge's formula:

c = 1 + 3.3 log n

where c is the number of classes and n is the number of observations in the data set. The value of log n can be ob-
tained by using a calculator.

38 Chapter 2 Organizing and Graphing Data

Now we read each value from the given data and mark a tally in the second column of
Table 2.9 next to the corresponding class. The first value in our original data set is 8, which be-
longs to the 5-9 class. To record it, we mark a tally in the second column next to the 5-9 class.
We continue this process until all the data values have been read and entered in the tally column.
Note that tallies are marked in blocks of five for counting convenience. After the tally column is
completed, we count the tally marks for each class and write those numbers in the third column.
This gives the column of frequencies. These frequencies represent the number of days on which
iPods indicated in classes are sold. For example, on 8 of 30 days, 15 to 19 iPods were sold.

Table 2.9 Frequency Distribution for the Data
on iPods Sold

iPods Sold

Tally

5-9

10-14

15-19

20-24

25-29

Xf= 30

In Table 2.9, we can denote the frequencies of the five classes by f lt / 2 , f$, fa, and/ 5 , respec-
tively. Therefore,

fi = Frequency of the first class = 3

Similarly,

f 2 = 6, / 3 = 8, / 4 = 8, and f 5 = 5

Using the 2 notation (see Section 1.8 of Chapter 1), we can denote the sum of frequencies of
all classes by 2/. Hence,

2/ = /, +/ 2 +f 3 + fa +f 5 = 3 + 6 + 8 + 8 + 5 = 30

The number of observations in a sample is usually denoted by n. Thus, for the sample data,
2/ is equal to n. The number of observations in a population is denoted by N. Consequently,
2/ is equal to N for population data. Because the data set on the total iPods sold on 30 days
in Table 2.9 is for only 30 days, it represents a sample. Therefore, in Table 2.9 we can denote
the sum of frequencies by n instead of 2/. B

Note that when we present the data in the form of a frequency distribution table, as in
Table 2.9, we lose the information on individual observations. We cannot know the exact num-
bers of iPods sold on any given day from Table 2.9. All we know is that for 3 days, 5 to 9 iPods
were sold, and so forth.

2.3.3 Relative Frequency and Percentage Distributions

Using Table 2.10, we can compute the relative frequency and percentage distributions in the
same way as we did for qualitative data in Section 2.2.2. The relative frequencies and percent-
ages for a quantitative data set are obtained as follows.

Calculating Relative Frequency and Percentage

Frequency of that class /

Relative frequency of a class = = -— ;

Sum of all frequencies 2 /

Percentage = (Relative frequency) • 100

Example 2-4 illustrates how to construct relative frequency and percentage distributions.

2.3 Organizing and Graphing Quantitative Data 39

■ EXAMPLE 2-4

Calculate the relative frequencies and percentages for Table 2.9.

Solution The relative frequencies and percentages for the data in Table 2.9 are calculated
and listed in the third and fourth columns, respectively, of Table 2.10. Note that the class
boundaries are listed in the second column of Table 2.10.

Table 2.10 Relative Frequency and Percentage Distributions for Table 2.9

iPods Sold

Class Boundaries

Relative Frequency

Percentage

5-9

4.5 to less than 9.5

3/30 = .100

10.0

10-14

9.5 to less than 14.5

6/30 = .200

20.0

15-19

14.5 to less than 19.5

8/30 = .267

26.7

20-24

19.5 to less than 24.5

8/30 = .267

26.7

25-29

24.5 to less than 29.5

5/30 = .167

16.7

Sum = 1.001

Sum = 100.1

Using Table 2.10, we can make statements about the percentage of days with iPods sold
within a certain interval. For example, on 20% of the days, 10 to 14 iPods were sold. By adding
the percentages for the first two classes, we can state that 5 to 14 iPods were sold on 30% of
the days. Similarly, by adding the percentages of the last two classes, we can state that 20 to
29 iPods were sold on 43.4% of the days. H

2.3.4 Graphing Grouped Data

Grouped (quantitative) data can be displayed in a histogram or a polygon. This section de-
scribes how to construct such graphs. We can also draw a pie chart to display the percentage
distribution for a quantitative data set. The procedure to construct a pie chart is similar to the
one for qualitative data explained in Section 2.2.3; it will not be repeated in this section.

Histograms

A histogram can be drawn for a frequency distribution, a relative frequency distribution, or
a percentage distribution. To draw a histogram, we first mark classes on the horizontal axis
and frequencies (or relative frequencies or percentages) on the vertical axis. Next, we draw
a bar for each class so that its height represents the frequency of that class. The bars in a
histogram are drawn adjacent to each other with no gap between them. A histogram is called
a frequency histogram, a relative frequency histogram, or a percentage histogram
depending on whether frequencies, relative frequencies, or percentages are marked on the
vertical axis.

Definition

Histogram A histogram is a graph in which classes are marked on the horizontal axis and the
frequencies, relative frequencies, or percentages are marked on the vertical axis. The frequen-
cies, relative frequencies, or percentages are represented by the heights of the bars. In a his-
togram, the bars are drawn adjacent to each other.

Figures 2.3 and 2.4 show the frequency and the relative frequency histograms, respectively,
for the data of Tables 2.9 and 2.10 of Sections 2.3.2 and 2.3.3. The two histograms look alike
because they represent the same data. A percentage histogram can be drawn for the percentage
distribution of Table 2.10 by marking the percentages on the vertical axis.

Constructing relative
frequency and percentage
distributions.

MORNING
GROOMING

USA TODAY Snapshots®

Morning grooming

How much time
do you spend on
hygiene/grooming in
the morning (including
showering, washing face
and hands, brushing
teeth, shaving, applying
makeup)?

16%

53%

5% J|

0-5 6-10 11-30 31-60 More
than
60

sou tee: sca "Hygiene Miners', survey of MS 3 adults

In minutes

By Michelle Hut; and Web Bryant. USA TODaV

Source: USA Today, February 24, 2009.
Copyright © 2009, USA Today. Chart
reproduced with permission.

The accompanying chart gives the bar chart for the percentage distribution of time that adults spend on hy-
giene/grooming (such as showering, washing face and hands, brushing teeth, shaving, applying makeup) in the
morning. The results are based on a survey of 1453 adults. For example, 5% of adults included in the sample
said that they spend to 5 minutes on such activities in the morning. Note that the percentages add up to 1 01 %
due to rounding. Also note that all the classes have different widths. The last class (more than 60) is called an
open-ended class. We know that it has a lower limit of more than 60, but it does not have an upper limit.

In Figures 2.3 and 2.4, we have used class limits to mark classes on the horizontal axis.
However, we can show the intervals on the horizontal axis by using the class boundaries instead
of the class limits.

|6
£ 4

HI-

□

5-9 10-14 15-19 20-24 25-29
iPods sold

Figure 2.3 Frequency histogram for Fable 2.9.

.30

>.
o
c
a>

= .20

a> .10

5-9 10-14 15-19 20-24 25-29
iPods sold

Figure 2.4 Relative frequency histogram
for Fable 2.10.

Polygons

A polygon is another device that can be used to present quantitative data in graphic form.
Fo draw a frequency polygon, we first mark a dot above the midpoint of each class at a
height equal to the frequency of that class. Fhis is the same as marking the midpoint at the

2.3 Organizing and Graphing Quantitative Data 41

top of each bar in a histogram. Next we mark two more classes, one at each end, and mark
their midpoints. Note that these two classes have zero frequencies. In the last step, we join
the adjacent dots with straight lines. The resulting line graph is called a frequency polygon
or simply a polygon.

A polygon with relative frequencies marked on the vertical axis is called a relative fre-
quency polygon. Similarly, a polygon with percentages marked on the vertical axis is called a
percentage polygon.

Definition

Polygon A graph formed by joining the midpoints of the tops of successive bars in a histogram
with straight lines is called a polygon.

Figure 2.5 shows the frequency polygon for the frequency distribution of Table 2.9.

—II—

5-9 10-14 15-19 20-24 25-29

Figure 2.5 Frequency polygon for Table 2.9.

For a very large data set, as the number of classes is increased (and the width of classes is
decreased), the frequency polygon eventually becomes a smooth curve. Such a curve is called
a frequency distribution curve or simply a frequency curve. Figure 2.6 shows the frequency
curve for a large data set with a large number of classes.

>.
o
c

0)
i_

Li.

Figure 2.6 Frequency distribution curve.

2.3.5 More on Classes and Frequency Distributions

This section presents two alternative methods for writing classes to construct a frequency dis-
tribution for quantitative data.

Less-Than Method for Writing Classes

The classes in the frequency distribution given in Table 2.9 for the data on iPods sold were writ-
ten as 5-9, 10-14, and so on. Alternatively, we can write the classes in a frequency distribution
table using the less-than method. The technique for writing classes shown in Table 2.9 is more

42 Chapter 2 Organizing and Graphing Data

commonly used for data sets that do not contain fractional values. The less-than method is more
appropriate when a data set contains fractional values. Example 2-5 illustrates the less-than
method.

■ EXAMPLE 2-5

On April 1, 2009, the federal tax on a pack of cigarettes was increased from 390 to $1.0066,
a move that not only was expected to help increase federal revenue, but was also expected to
save about 900,000 lives (Time Magazine, April 2009). Table 2.11 shows the total tax (state
plus federal) per pack of cigarettes for all 50 states as of April 1, 2009.

Table 2.11 Total Tax per Pack of Cigarettes

Total Tax

State*

(in rlnlltir^i

State

1.43

2.71

3.01

1.65

3.01

1.81

2.16

2.34

1.88

3.58

1.85

1.92

3.01

3.76

2.16

1.36

1.35

1.45

1.38

2.26

3.01

2.04

1.58

2.19

1.99

2.36

2.00

3.47

2.37

1.08

1.80

2.54

1.61

1.63

1.37

2.42

3.01

1.70

3.01

3.00

3.52

1.31

3.01

3.03

2.51

1.56

1.19

2.78

1.18

1.61

Source:

Campaign for Tobacco-Free Kids.

Construct a frequency distribution table. Calculate the relative frequencies and percentages for
all classes.

Solution The minimum value in this data set on cigarette taxes given in Table 2.11 is 1.08
and the maximum value is 3.76. Suppose we decide to group these data using six classes of
equal width. Then

3.76 - 1.08

Approximate width of a class = = .45

Constructing a frequency
distribution using the
less-than method.

2.3 Organizing and Graphing Quantitative Data 43

We round this number to a more convenient number, say .50. Then we take .50 as the width
of each class. We can take a lower limit of the first class equal to 1.08 or any number lower
than 1.08. If we start the first class at 1, the classes will be written as 1 to less than 1.5, 1.5
to less than 2.00, and so on. The six classes, which cover all the data values, are recorded in
the first column of Table 2.12. The second column lists the frequencies of these classes. A
value in the data set that is 1 or larger but less than 1 .5 belongs to the first class, a value that
is 1.50 or larger but less than 2.00 falls in the second class, and so on. The relative frequen-
cies and percentages for classes are recorded in the third and fourth columns, respectively, of
Table 2.12. Note that this table does not contain a column of tallies.

Table 2.12 Frequency, Relative Frequency, and Percentage Distributions of the Total
Tax on a Pack of Cigarettes

Total Tax (in dollars)

Relative
Frequency

Percentage

1.00 to less than 1.50

.20

1.50 to less than 2.00

.26

2.00 to less than 2.50

.20

2.50 to less than 3.00

.08

3.00 to less than 3.50

.20

3.50 to less than 4.00

.06

tf= 50

Sum = 1.00

Sum = 100

A histogram and a polygon for the data of Table 2.12 can be drawn the same way as for
the data of Tables 2.9 and 2.10.

Single-Valued Classes

If the observations in a data set assume only a few distinct (integer) values, it may be ap-
propriate to prepare a frequency distribution table using single-valued classes — that is,
classes that are made of single values and not of intervals. This technique is especially use-
ful in cases of discrete data with only a few possible values. Example 2-6 exhibits such a
situation.

■ EXAMPLE 2-6

The administration in a large city wanted to know the distribution of vehicles owned by house-
holds in that city. A sample of 40 randomly selected households from this city produced the
following data on the number of vehicles owned.

5
1
2
4

2
1

2
2
1

4
1

Construct a frequency distribution table for these data using single-valued classes.

Solution The observations in this data set assume only six distinct values: 0, 1, 2, 3, 4,
and 5. Each of these six values is used as a class in the frequency distribution in Table 2.13,
and these six classes are listed in the first column of that table. To obtain the frequencies of
these classes, the observations in the data that belong to each class are counted, and the results
are recorded in the second column of Table 2.13. Thus, in these data, 2 households own no
vehicle, 18 own one vehicle each, 11 own two vehicles each, and so on.

Constructing a
frequency distribution using
single-valued classes.

Chapter 2 Organizing and Graphing Data

Table 2.13 Frequency Distribution of
Vehicles Owned

Number of

Vehicles Owned

Households (/)

tf =40

The data of Table 2.13 can also be displayed in a bar graph, as shown in Figure 2.7. To
construct a bar graph, we mark the classes, as intervals, on the horizontal axis with a little
gap between consecutive intervals. The bars represent the frequencies of respective classes.

The frequencies of Table 2.13 can be converted to relative frequencies and percentages the
same way as in Table 2.11. Then, a bar graph can be constructed to display the relative frequency
or percentage distribution by marking the relative frequencies or percentages, respectively, on
the vertical axis.

18
15

c 1

£ 9

12 3 4
Vehicles owned

Figure 2.7 Bar graph for Table 2.13.

2.4 Shapes of Histograms

A histogram can assume any one of a large number of shapes. The most common of these
shapes are

1. Symmetric

2. Skewed

3. Uniform or rectangular

A symmetric histogram is identical on both sides of its central point. The histograms
shown in Figure 2.8 are symmetric around the dashed lines that represent their central points.

2.4 Shapes of Histograms 45

Variable

Figure 2.8 Symmetric histograms.

Variable

A skewed histogram is nonsymmetric. For a skewed histogram, the tail on one side is
longer than the tail on the other side. A skewed-to-the-right histogram has a longer tail on the
right side (see Figure 2.9a). A skewed-to-the-left histogram has a longer tail on the left side
(see Figure 2.9b).

Variable Variable

(a) (b)
Figure 2.9 (a) A histogram skewed to the right, (b) A histogram skewed to the left.

A uniform or rectangular histogram has the same frequency for each class. Figure 2.10
is an illustration of such a case.

>.
u
c

CD
3
!T
CD

Figure 2. TO A histogram with uniform
distribution.

Variable

Figures 2.1 la and 2.1 lb display symmetric frequency curves. Figures 2.1 lc and 2.1 la" show
frequency curves skewed to the right and to the left, respectively.

Figure 2.1 1 (a), (b) Symmetric frequency curves, (c) Frequency curve skewed to
the right, (d) Frequency curve skewed to the left.

46 Chapter 2 Organizing and Graphing Data

Warning ► Describing data using graphs give us insights into the main characteristics of the data. But
graphs, unfortunately, can also be used, intentionally or unintentionally, to distort the facts and
deceive the reader. The following are two ways to manipulate graphs to convey a particular
opinion or impression.

1. Changing the scale either on one or on both axes — that is, shortening or stretching one or
both of the axes.

2. Truncating the frequency axis — that is, starting the frequency axis at a number greater than zero.

When interpreting a graph, we should be very cautious. We should observe carefully whether
the frequency axis has been truncated or whether any axis has been unnecessarily shortened or
stretched. See the Uses and Misuses section of this chapter for such an example.

EXERCISES

CONCEPTS AND PROCEDURES

2.11 Briefly explain the three decisions that have to be made to group a data set in the form of a fre-
quency distribution table.

2.12 How are the relative frequencies and percentages of classes obtained from the frequencies of classes?
Illustrate with the help of an example.

2.13 Three methods — writing classes using limits, using the less-than method, and grouping data using
single-valued classes — were discussed to group quantitative data into classes. Explain these three meth-
ods and give one example of each.

■ APPLICATIONS

2.14 A sample of 80 adults was taken, and these adults were asked about the number of credit cards they
possess. The following table gives the frequency distribution of their responses.

Number of Credit Cards Number of Adults

to 3 18

4 to 7 26

8 to 1 1 22

12 to 15 11

16 to 19 3

a. Find the class boundaries and class midpoints.

b. Do all classes have the same width? If so, what is this width?

c. Prepare the relative frequency and percentage distribution columns.

d. What percentage of these adults possess 8 or more credit cards?

2.15 The following table gives the frequency distribution of ages for all 50 employees of a company.

Age

Number of Employees

18 to 30

31 to 43

44 to 56

57 to 69

a. Find the class boundaries and class midpoints.

b. Do all classes have the same width? If yes, what is that width?

c. Prepare the relative frequency and percentage distribution columns.

d. What percentage of the employees of this company are age 43 or younger?

2.16 A data set on money spent on lottery tickets during the past year by 200 households has a lowest value
of $1 and a highest value of SI 167. Suppose we want to group these data into six classes of equal widths.

a. Assuming that we take the lower limit of the first class as $1 and the width of each class equal
to S200, write the class limits for all six classes.

b. What are the class boundaries and class midpoints?

2.4 Shapes of Histograms 47

2.17 A data set on monthly expenditures (rounded to the nearest dollar) incurred on fast food by a sam-
ple of 500 households has a minimum value of S3 and a maximum value of $147. Suppose we want to
group these data into six classes of equal widths.

a. Assuming that we take the lower limit of the first class as $1 and the upper limit of the sixth class
as $150, write the class limits for all six classes.

b. Determine the class boundaries and class widths.

c. Find the class midpoints.

2.18 The accompanying table lists the 2006-07 median household incomes (rounded to the nearest dollar),
for all 50 states and the District of Columbia.

State

'yiUif^ftl Mori! -in

zuuo— u / ivieuian
Household Income

State

zuuo— u / ivieuian
Household Income

40,620

42,963

60,506

49,342

47,598

53,912

39,452

65,652

56,311

65,249

59,209

42,760

64,158

49,267

54,257

42,219

D.C.

50,318

44,708

46,383

48,151

49,692

41,578

63,104

49,331

48,354

49,145

51,279

54,735

47,074

42,477

49,200

46,567

47,671

41,521

40,029

45,294

39,418

54,853

47,415

50,423

65,552

58,950

57,681

57,178

49,699

40,800

57,932

52,218

36,499

48,560

45,924

Source: U.S. Census Bureau.

a. Construct a frequency distribution table. Use the following classes: 36,000^-0,999, 41,000-
45,999, 46,000-50,999, 51,000-55,999, 56,000-60,999, 61,000-65,999.

b. Calculate the relative frequencies and percentages for all classes.

c. Based on the frequency distribution, can you say whether the data are symmetric or skewed?

d. What percentage of these states had a median household income of less than $56,000?

2.19 Nixon Corporation manufactures computer monitors. The following data are the numbers of com-
puter monitors produced at the company for a sample of 30 days.

48 Chapter 2 Organizing and Graphing Data

a. Construct a frequency distribution table using the classes 21-23, 24-26, 27-29, 30-32, and 33-35.

b. Calculate the relative frequencies and percentages for all classes.

c. Construct a histogram and a polygon for the percentage distribution.

d. For what percentage of the days is the number of computer monitors produced in the interval
27-29?

2.20 The following data give the numbers of computer keyboards assembled at the Twentieth Century
Electronics Company for a sample of 25 days.

45 52 48 41 56 46 44 42 48 53 51 53 51
48 46 43 52 50 54 47 44 47 50 49 52

a. Make the frequency distribution table for these data.

b. Calculate the relative frequencies for all classes.

c. Construct a histogram for the relative frequency distribution.

d. Construct a polygon for the relative frequency distribution.

2.21 Since 1996, Slate.com has determined the Slate 60, which is a list of the largest American charita-
ble contributions by individuals each year. The accompanying table gives the names of the 22 persons and
the money they donated in 2008.

Donation

Donor

(millions of dollars)

Harold Alfond

360

Donald B. and Dorothy L. Stabler

334.2

David G. and Suzanne D. Booth

300

Frank C. Doble

272

Robert L. and Catherine H. McDevitt

250

Michael R. Bloomberg

235

Dorothy Clarke Patterson

225

Richard W. Weiland

174.3

Helen L. Kimmel

156.5

Jeffrey S. Skoll

144.1

H. F. (Gerry) and Marguerite B. Lenfest

139.9

David Rockefeller

137.8

Stephen A. Schwarzman

105

David H. Koch

LOO

Gerhard R. Andlinger

100

Eli and Edythe L. Broad

100

Philip H. and Penelope Knight

100

Kenneth G. and Elaine A. Langone

100

Fritz J. and Dolores H. Russ

94.8

Frank Sr. and Jane Batten

Jesse H. and Beulah C. Cox

83.5

Henry R. and Marie-Josee Kravis

Source: Slate.com, January 26, 2009.

a. Construct a frequency distribution table using the following classes: 75 to less than 125, 125 to
less than 175, 175 to less than 225, and so on.

b. Calculate the relative frequencies and percentages for all classes.

Exercises 2.22 through 2.26 are based on the following data.

The following table gives the age-adjusted cancer incidence rates (new cases) per 100,000 people for
three of the most common types of cancer contracted by both females and males: colon and rectum

2.4 Shapes of Histograms 49

cancer, lung and bronchus cancer, and non-hodgkin lymphoma. The rates given are for 22 states west
of the Mississippi River for the years 2000 to 2004 (except for South Dakota, for which the data are
for 2001 to 2004), which are the most recent data available from the American Cancer Society. Age-
adjusted rates take into account the percentage of people in different age groups within each state's
population.

State

Colon and
Rectum
(Males)

Colon and
Rectum
(Females)

Lung and
Bronchus
(Males)

Lung and
Bronchus
(Females)

Non-Hodgkin
Lymphoma
(Males)

Non-Hodgkin
Lymphoma
(Females)

63.9

50.0

87.0

59.2

24.0

14.6

52.3

37.4

71.2

49.3

18.9

13.4

60.5

43.1

113.7

57.8

20.9

15.2

55.0

40.4

69.0

47.9

22.4

15.4

53.1

41.4

65.1

45.9

21.6

16.6

64.7

41.6

67.8

37.6

17.9

13.2

51.9

39.7

71.4

45.8

21.4

17.7

69.0

51.5

89.5

50.8

22.8

17.1

71.3

48.9

112.3

57.4

22.8

16.0

65.9

47.5

104.3

60.0

22.0

16.0

56.5

43.3

79.8

56.7

22.8

15.0

69.1

48.6

84.0

47.8

23.8

17.2

59.8

44.6

88.8

71.4

22.0

15.5

51.5

35.7

59.5

37.7

18.1

13.9

66.3

43.3

71.3

43.9

22.1

15.1

63.1

44.6

109.2

63.1

22.0

15.7

56.6

42.6

81.4

61.0

24.2

17.3

66.4

47.5

82.0

43.3

22.5

17.1

59.7

41.0

91.0

51.0

21.7

15.9

47.5

35.2

40.3

20.9

23.2

15.7

55.9

42.5

82.0

60.5

26.6

18.3

49.5

43.4

65.3

44.8

18.9

17.6

Source: American Cancer Society, 2008.

2.22 a. Prepare a frequency distribution table for colon and rectum cancer rates for women using six

classes of equal width,
b. Construct the relative frequency and percentage distribution columns.

2.23 a. Prepare a frequency distribution table for colon and rectum cancer rates for men using six classes

of equal width.

b. Construct the relative frequency and percentage distribution columns.

2.24 a. Prepare a frequency distribution table for lung and bronchus cancer rates for women.

b. Construct the relative frequency and percentage distribution columns.

c. Draw a histogram and polygon for the relative frequency distribution.

2.25 a. Prepare a frequency distribution table for lung and bronchus cancer rates for men.

b. Construct the relative frequency and percentage distribution columns.

c. Draw a histogram and polygon for the relative frequency distribution.

2.26 a. Prepare a frequency distribution table for non-Hodgkin lymphoma rates for women.

b. Construct the relative frequency and percentage distribution columns.

c. Draw a histogram and polygon for the relative frequency distribution.

2.27 The accompanying table lists the offensive points scored per game (PPG) by each of the 16 teams in
the American Football Conference (AFC) of the National Football League (NFL) during the 2008 season.

50 Chapter 2 Organizing and Graphing Data

Team

PPG

Team

PPG

Baltimore

24.1

Kansas City

18.2

Buffalo

21.0

Miami

21.6

Cincinnati

12.8

New England

25.6

Cleveland

14.5

New York Jets

25.3

Denver

23.1

Oakland

16.4

Houston

22.9

Pittsburgh

21.7

Indianapolis

23.6

San Diego

27.4

Jacksonville

18.9

Tennessee

23.4

a. Construct a frequency distribution table. Take 12.0 as the lower boundary of the first class and
3.5 as the width of each class.

b. Prepare the relative frequency and percentage distribution columns for the frequency table of part a.

2.28 The following data give the number of turnovers (fumbles and interceptions) by a college football
team for each game in the past two seasons.

321402210323
023141324012

a. Prepare a frequency distribution table for these data using single- valued classes.

b. Calculate the relative frequencies and percentages for all classes.

c. In how many games did the team commit two or more turnovers?

d. Draw a bar graph for the frequency distribution of part a.

2.29 According to a survey by the U.S. Public Interest Research Group, about 79% of credit reports contain
errors. Suppose in a random sample of 25 credit reports, the number of errors found are as listed below.

1023010541021
412203100123

a. Prepare a frequency distribution table for these data using single- valued classes.

b. Calculate the relative frequencies and percentages for all classes.

c. How many of these reports contained two or more errors?

d. Draw a bar graph for the frequency distribution of part a.

2.30 The following table gives the frequency distribution for the numbers of parking tickets received on
the campus of a university during the past week for 200 students.

Number of Tickets

Number of Students

Draw two bar graphs for these data, the first without truncating the frequency axis and the second by trun-
cating the frequency axis. In the second case, mark the frequencies on the vertical axis starting with 25.
Briefly comment on the two bar graphs.

2.31 Eighty adults were asked to watch a 30-minute infomercial until the presentation ended or until bore-
dom became intolerable. The following table lists the frequency distribution of the times that these adults
were able to watch the infomercial.

Time

Number of

(minutes)

Adults

to less than 6

6 to less than 12

12 to less than 18

18 to less than 24

24 to less than 30

2.5 Cumulative Frequency Distributions 51

Draw two histograms for these data, the first without truncating the frequency axis. In the second case,
mark the frequencies on the vertical axis starting with 10. Briefly comment on the two histograms.

2.5 Cumulative Frequency Distributions

Consider again Example 2-3 of Section 2.3.2 about the total number of iPods sold by a com-
pany. Suppose we want to know on how many days the company sold 19 or fewer iPods. Such
a question can be answered by using a cumulative frequency distribution. Each class in a cu-
mulative frequency distribution table gives the total number of values that fall below a certain
value. A cumulative frequency distribution is constructed for quantitative data only.

Definition

Cumulative Frequency Distribution A cumulative frequency distribution gives the total number
of values that fall below the upper boundary of each class.

In a cumulative frequency distribution table, each class has the same lower limit but a dif-
ferent upper limit. Example 2-7 illustrates the procedure for preparing a cumulative frequency
distribution.

■ EXAMPLE 2-7

Using the frequency distribution of Table 2.9, reproduced here, prepare a cumulative frequency
distribution for the number of iPods sold by that company.

iPods Sold /

5-9

10-14

15-19

20-24

25-29

Solution Table 2.14 gives the cumulative frequency distribution for the number of iPods
sold. As we can observe, 5 (which is the lower limit of the first class in Table 2.9) is taken as
the lower limit of each class in Table 2.14. The upper limits of all classes in Table 2.14 are
the same as those in Table 2.9. To obtain the cumulative frequency of a class, we add the fre-
quency of that class in Table 2.9 to the frequencies of all preceding classes. The cumulative
frequencies are recorded in the third column of Table 2.14. The second column of this table
lists the class boundaries.

Table 2.14 Cumulative Frequency Distribution of iPods Sold

Class Limits Class Boundaries Cumulative Frequency

5-9

4.5 to less than 9.5

5-14

4.5 to less than 14.5

3 + 6 = 9

5-19

4.5 to less than 19.5

3 + 6 + 8 = 17

5-24

4.5 to less than 24.5

3 + 6 + 8 + 8 = 25

5-29

4.5 to less than 29.5

3 + 6 + 8 + 8 + 5 = 30

From Table 2.14, we can determine the number of observations that fall below the upper
limit or boundary of each class. For example, 19 or fewer iPods were sold on 17 days. H

Constructing

a cumulative frequency

distribution table.

52 Chapter 2 Organizing and Graphing Data

The cumulative relative frequencies are obtained by dividing the cumulative frequencies
by the total number of observations in the data set. The cumulative percentages are obtained
by multiplying the cumulative relative frequencies by 100.

Calculating Cumulative Relative Frequency and Cumulative Percentage

Cumulative frequency of a class

Cumulative relative frequency =

Total observations in the data set

Cumulativep ercentage = (Cumulative relative frequency) • 100

Table 2.15 contains both the cumulative relative frequencies and the cumulative percent-
ages for Table 2.14. We can observe, for example, that 19 or fewer iPods were sold on 56.7%
of the days.

Table 2.15 Cumulative Relative Frequency and
Cumulative Percentage Distributions
for iPods Sold

Cumulative

Class Limits

Relative Frequency

Percentage

5-9

3/30 = .100

10.0

5-14

9/30 = .300

30.0

5-19

17/30 = .567

56.7

5-24

25/30 = .833

83.3

5-29

30/30 = 1.000

100.0

Ogives

When plotted on a diagram, the cumulative frequencies give a curve that is called an ogive
(pronounced o-jive ). Figure 2.12 gives an ogive for the cumulative frequency distribution of
Table 2.14. To draw the ogive in Figure 2.12, the variable, which is total iPods sold, is marked
on the horizontal axis and the cumulative frequencies on the vertical axis. Then the dots are
marked above the upper boundaries of various classes at the heights equal to the corresponding
cumulative frequencies. The ogive is obtained by joining consecutive points with straight lines.
Note that the ogive starts at the lower boundary of the first class and ends at the upper bound-
ary of the last class.

>.
o
c

<D
ZS

I 15
3

— n — <

4.5 9.5

14.5 19.5
iPods sold

24.5 29.5

Figure 2.12 Ogive for the cumulative frequency
distribution of Table 2.14.

2.5 Cumulative Frequency Distributions 53

Definition

Ogive An ogive is a curve drawn for the cumulative frequency distribution by joining with
straight lines the dots marked above the upper boundaries of classes at heights equal to the cumu-
lative frequencies of respective classes.

One advantage of an ogive is that it can be used to approximate the cumulative frequency
for any interval. For example, we can use Figure 2.12 to find the number of days for which 17
or fewer iPods were sold. First, draw a vertical line from 17 on the horizontal axis up to the
ogive. Then draw a horizontal line from the point where this line intersects the ogive to the ver-
tical axis. This point gives the cumulative frequency of the class 5 to 17. In Figure 2.12, this
cumulative frequency is (approximately) 13 as shown by the dashed line. Therefore, 17 or fewer
iPods were sold on 13 days.

We can draw an ogive for cumulative relative frequency and cumulative percentage distri-
butions the same way as we did for the cumulative frequency distribution.

EXERCISES

CONCEPTS AND PROCEDURES

2.32 Briefly explain the concept of cumulative frequency distribution. How are the cumulative relative
frequencies and cumulative percentages calculated?

2.33 Explain for what kind of frequency distribution an ogive is drawn. Can you think of any use for an
ogive? Explain.

■ APPLICATIONS

2.34 The following table, reproduced from Exercise 2.14, gives the frequency distribution of the number
of credit cards possessed by 80 adults.

Number of Credit Cards Number of Adults

to 3 18
4 to 7 26
8 to 1 1 22
12 to 15 11
16 to 19 3

a. Prepare a cumulative frequency distribution.

b. Calculate the cumulative relative frequencies and cumulative percentages for all classes.

c. Find the percentage of these adults who possess 7 or fewer credit cards.

d. Draw an ogive for the cumulative percentage distribution.

e. Using the ogive, find the percentage of adults who possess 10 or fewer credit cards.

2.35 The following table, reproduced from Exercise 2.15, gives the frequency distribution of ages for all
50 employees of a company.

Age

Number of Employees

18 to 30

31 to 43

44 to 56

57 to 69

a. Prepare a cumulative frequency distribution table.

b. Calculate the cumulative relative frequencies and cumulative percentages for all classes.

54 Chapter 2 Organizing and Graphing Data

c. What percentage of the employees of this company are 44 years of age or older?

d. Draw an ogive for the cumulative percentage distribution.

e. Using the ogive, find the percentage of employees who are age 40 or younger.

2.36 Using the frequency distribution table constructed in Exercise 2.18, prepare the cumulative fre-
quency, cumulative relative frequency, and cumulative percentage distributions.

2.37 Using the frequency distribution table constructed in Exercise 2. 19, prepare the cumulative frequency,
cumulative relative frequency, and cumulative percentage distributions.

2.38 Using the frequency distribution table constructed in Exercise 2.20, prepare the cumulative frequency,
cumulative relative frequency, and cumulative percentage distributions.

2.39 Prepare the cumulative frequency, cumulative relative frequency, and cumulative percentage distributions
using the frequency distribution constructed in Exercise 2.23.

2.40 Using the frequency distribution table constructed for the data of Exercise 2.25, prepare the cumulative
frequency, cumulative relative frequency, and cumulative percentage distributions.

2.41 Refer to the frequency distribution table constructed in Exercise 2.26. Prepare the cumulative frequency,
cumulative relative frequency, and cumulative percentage distributions by using that table.

2.42 Using the frequency distribution table constructed for the data of Exercise 2.21, prepare the cumulative
frequency, cumulative relative frequency, and cumulative percentage distributions. Draw an ogive for the
cumulative frequency distribution. Using the ogive, find the (approximate) number of individuals who
made charitable contributions of $200 million or less.

2.43 Refer to the frequency distribution table constructed in Exercise 2.27. Prepare the cumulative frequency,
cumulative relative frequency, and cumulative percentage distributions. Draw an ogive for the cumulative
frequency distribution. Using the ogive, find the (approximate) number of teams with 20 or fewer offen-
sive points scored per game.

2.6 Stem-and-Leaf Displays

Another technique that is used to present quantitative data in condensed form is the stem-and-
leaf display. An advantage of a stem-and-leaf display over a frequency distribution is that by
preparing a stem-and-leaf display we do not lose information on individual observations. A
stem-and-leaf display is constructed only for quantitative data.

Definition

Stem-and-Leaf Display In a stem-and-leaf display of quantitative data, each value is divided into
two portions — a stem and a leaf. The leaves for each stem are shown separately in a display.

Example 2-8 describes the procedure for constructing a stem-and-leaf display.

Constructing a
stem-and-leaf display for
two-digit numbers.

■ EXAMPLE 2-8

The following are the scores of 30 college students on a statistics test.

Construct a stem-and-leaf display.

Solution To construct a stem-and-leaf display for these scores, we split each score into two
parts. The first part contains the first digit, which is called the stem. The second part contains
the second digit, which is called the leaf. Thus, for the score of the first student, which is 75,
7 is the stem and 5 is the leaf. For the score of the second student, which is 52, the stem is 5
and the leaf is 2. We observe from the data that the stems for all scores are 5, 6, 7, 8, and 9
because all the scores lie in the range 50 to 98. To create a stem-and-leaf display, we draw a

2.6 Stem-and-Leaf Displays 55

vertical line and write the stems on the left side of it, arranged in increasing order, as shown
in Figure 2.13.

Stems

Figure 2.13 Stem-and-leaf display.

2 < Leaf for 52

5 < Leaf for 75

After we have listed the stems, we read the leaves for all scores and record them next
to the corresponding stems on the right side of the vertical line. For example, for the first
score we write the leaf 5 next to the stem 7; for the second score we write the leaf 2 next
to the stem 5. The recording of these two scores in a stem-and-leaf display is shown in
Figure 2.13.

Now, we read all the scores and write the leaves on the right side of the vertical line in
the rows of corresponding stems. The complete stem-and-leaf display for scores is shown in
Figure 2.14.

2 7

5 9 18 4

5 9 1 2 6 9 7
7 1 6 3 4 7

6 3 5 2 2 8

Figure 2.14 Stem-and-
leaf display of test scores.

1 2

By looking at the stem-and-leaf display of Figure 2.14, we can observe how the data values
are distributed. For example, the stem 7 has the highest frequency, followed by stems 8, 9, 6,
and 5.

The leaves for each stem of the stem-and-leaf display of Figure 2. 14 are ranked (in increasing
order) and presented in Figure 2.15.

5 2 7

6 1 4 5 8 9

7 112256799

8 1 3 4 6 7 7

9 2 2 3 5 6 8

Figure 2.15 Ranked
stem-and-leaf display of
test scores.

As already mentioned, one advantage of a stem-and-leaf display is that we do not lose
information on individual observations. We can rewrite the individual scores of the 30 college
students from the stem-and-leaf display of Figure 2.14 or Figure 2.15. By contrast, the
information on individual observations is lost when data are grouped into a frequency table.

■ EXAMPLE 2-9

The following data give the monthly rents paid by a sample of 30 households selected from
a small town.

880 1081
1210 985
1151 630

721
1231
1175

1075
932
952

1023
850
1100

775
825
1140

1235
1000
750

750
915
1140

965 960
1191 1035
1370 1280

Constructing a stem-and-leaf
display for three-and four-digit
numbers.

Construct a stem-and-leaf display for these data.

56 Chapter 2 Organizing and Graphing Data

Solution Each of the values in the data set contains either three or four digits. We will take
the first digit for three-digit numbers and the first two digits for four-digit numbers as stems.
Then we will use the last two digits of each number as a leaf. Thus for the first value, which
is 880, the stem is 8 and the leaf is 80. The stems for the entire data set are 6, 7, 8, 9, 10, 11,
12, and 13. They are recorded on the left side of the vertical line in Figure 2.16. The leaves
for the numbers are recorded on the right side.

Figure 2.16 Stem-and-leaf
display of rents.

Sometimes a data set may contain too many stems, with each stem containing only a few
leaves. In such cases, we may want to condense the stem-and-leaf display by grouping the stems.
Example 2-10 describes this procedure.

EXAMPLE 2-10

Preparing a grouped
stem-and-leaf display.

The following stem-and-leaf display is prepared for the number of hours that 25 students spent
working on computers during the past month.

1 7 9

2 6

2 4 7 8

1 5 6 9 9

3 6 8

2 4 4 5 7

5 6

Prepare a new stem-and-leaf display by grouping the stems.

Solution To condense the given stem-and-leaf display, we can combine the first three rows,
the middle three rows, and the last three rows, thus getting the stems 0-2, 3-5, and 6-8. The
leaves for each stem of a group are separated by an asterisk (*), as shown in Figure 2. 17. Thus,
the leaf 6 in the first row corresponds to stem 0; the leaves 1, 7, and 9 correspond to stem 1;
and leaves 2 and 6 belong to stem 2.

0-26*179*26

3- 5 2478*15699*368

6-824457**56

Figure 2.17 Grouped stem-and-leaf display.

If a stem does not contain a leaf, this is indicated in the grouped stem-and-leaf display by
two consecutive asterisks. For example, in the above stem-and-leaf display, there is no leaf for
7; that is, there in no number in the 70s. Hence, in Figure 2.17, we have two asterisks after
the leaves for 6 and before the leaves for 8.

2.6 Stem-and-Leaf Displays 57

EXERCISES

CONCEPTS AND PROCEDURES

2.44 Briefly explain how to prepare a stem-and-leaf display for a data set. You may use an example to
illustrate.

2.45 What advantage does preparing a stem-and-leaf display have over grouping a data set using a fre-
quency distribution? Give one example.

2.46 Consider this stem-and-leaf display.

3 6

14 5

34677789
2 2 3 5 6 6 9
7 8 9

Write the data set that is represented by the display.
2.47 Consider this stem-and-leaf display.

2-3
4-5
6-8

18 45 56 * 29 67 83 97

04 27 33 71 * 23 37 51 63 81 92

22 36 47 55 78 89 * * 10 41

Write the data set that is represented by the display.

■ APPLICATIONS

2.48 The following data give the time (in minutes) that each of 20 students waited in line at their book-
store to pay for their textbooks in the beginning of Spring 2009 semester. {Note: To prepare a stem-and-
leaf display, each number in this data set can be written as a two-digit number. For example, 8 can be writ-
ten as 08, for which the stem is and the leaf is 8.)

15 8 23 21 5 17 31 22 34 6

5 10 14 17 16 25 30 3 31 19

Construct a stem-and-leaf display for these data. Arrange the leaves for each stem in increasing order.

2.49 Following are the total yards gained rushing during the 2009 season by 14 running backs of 14 college
football teams.

745 921 1133 1024 848 775 800

1009 1275 857 933 1145 967 995

Prepare a stem-and-leaf display. Arrange the leaves for each stem in increasing order.

2.50 Reconsider the data on the numbers of computer monitors produced at the Nixon Corporation for a
sample of 30 days given in Exercise 2.19. Prepare a stem-and-leaf display for those data. Arrange the
leaves for each stem in increasing order.

2.51 Reconsider the data on the numbers of computer keyboards assembled at the Twentieth Century Elec-
tronics Company given in Exercise 2.20. Prepare a stem-and-leaf display for those data. Arrange the leaves
for each stem in increasing order.

2.52 Refer to Exercise 2.18. Rewrite those data by rounding each median household income to the near-
est thousand. For example, a median household income of $43,260 will be rounded to 43 thousand, and
$50,689 will be rounded to 51 thousand. Prepare a stem-and-leaf display for these data. Arrange the leaves
for each stem in increasing order.

2.53 These data give the times (in minutes) taken to commute from home to work for 20 workers.

10 50 65 33 48 5 11 23 39 26

26 32 17 7 15 19 29 43 21 22

Construct a stem-and-leaf display for these data. Arrange the leaves for each stem in increasing order.

58 Chapter 2 Organizing and Graphing Data

2.54 The following data give the times served (in months) by 35 prison inmates who were released recently.

a. Prepare a stem-and-leaf display for these data.

b. Condense the stem-and-leaf display by grouping the stems as 0-2, 3-5, and 6-9.

2.55 The following data give the money (in dollars) spent on textbooks by 35 students during the 2009-10
academic year.

565

728

470

620

345

368

610

765

550

845

530

705

490

258

320

505

457

787

617

721

635

438

575

702

538

720

460

540

890

560

570

706

430

268

638

a. Prepare a stem-and-leaf display for these data using the last two digits as leaves.

b. Condense the stem-and-leaf display by grouping the stems as 2-4, 5-6, and 7-8.

2.7 Dotplots

One of the simplest methods for graphing and understanding quantitative data is to create a dot-
plot. As with most graphs, statistical software should be used to make a dotplot for large data
sets. However, Example 2-1 1 demonstrates how to create a dotplot by hand.

Dotplots can help us detect outliers (also called extreme values) in a data set. Outliers are
the values that are extremely large or extremely small with respect to the rest of the data values.

Definition

Outliers or Extreme Values Values that are very small or very large relative to the majority of
the values in a data set are called outliers or extreme values.

■ EXAMPLE 2-11

Table 2.16 lists the lengths of the longest field goals (in yards) made by all kickers in the
American Football Conference (AFC) of the National Football League (NFL) during the
2008 season. Create a dotplot for these data.

Table 2.16 Distances of Longest Field Goals (in Yards) Made by AFC Kickers During the
2008 NFL Season

Name

Team

Distance

Name

Team

Distance

S. Hauschka

Baltimore

C. Baith

Kansas City

M. Stover

Baltimore

N. Novak

Kansas City

R. Lindell

Buffalo

D. Carpenter

Miami

S. Graham

Cincinnati

S. Gostkowski

New England

D. Rayner

Cincinnati

J. Feely

New York Jets

P. Dawson

Cleveland

S. Janikowski

Oakland

M. Prater

Denver

J. Reed

Pittsburgh

K. Brown

Houston

N. Kaeding

San Diego

A. Vinatieri

Indianapolis

R. Bironas

Tennessee

J. Scobee

Jacksonville

Source: ESPN.com.

Solution Below we show how to make a dotplot for these data on field goal distances.

Step 1. The minimum and maximum values in this data set are 26 and 57 yards, respec-
tively. First, we draw a horizontal line (let us call this the numbers line) with numbers that

Creating a dotplot.

2.7 Dotplots 59

cover the given data as shown in Figure 2.18. Note that the numbers line in Figure 2.18
shows the values from 25 to 57.

— i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 — *-

25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57

Figure 2.18 Numbers line.

Step 2. Place a dot above the value on the numbers line that represents each distance listed
in the table. For example, S. Hauschka's longest successful field goal of the 2008 season was
54 yards. Place a dot above 54 on the numbers line as shown in Figure 2.19. If there are two
or more observations with the same value, we stack dots vertically above each other to rep-
resent those values. For example, as shown in Table 2.16, 53 yards was the distance of the
longest field goals made by four players. We stack four dots (one for each player) above 53
on the numbers line as shown in Figure 2.19. Figure 2.19 gives the complete dotplot.

• • • • •

• ••• ••••••••

— i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 — »-

25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57

Figure 2.19 Dotplot for Longest Completed Field Goal.

As we examine the dotplot of Figure 2.19, we notice that there are two clusters (groups) of data.
Approximately 70% of the kickers made field goals of 50 to 57 yards. All but one of the re-
maining kickers completed long field goals of 43 to 47 yards. In addition, one kicker, D. Rayner
of the Cincinnati Bengals, had a longest field goal of 26 yards. When this occurs, we expect that
such a data value could be an outlier. (In the box-and-whisker section of Chapter 3, we will learn
a numerical method to determine whether a data point should be classified as an outlier.) I

Dotplots are also very useful for comparing two or more data sets. To do so, we create a
dotplot for each data set with numbers lines for all data sets on the same scale. We place these
data sets on top of each other, resulting in what are called stacked dotplots. Example 2-12
shows this procedure.

■ EXAMPLE 2-12

Refer to Table 2.16 in Example 2—11, which gives the distances of longest completed field
goals for all kickers in the AFC during the 2008 NFL season. Table 2.17 provides the same
information for the kickers in the National Football Conference (NFC) of the NFL for the 2008
season. Make dotplots for both sets of data and compare these two dotplots.

Table 2.17 Distances of Longest Field Goals (in Yards) Made by NFC Kickers During the
2008 NFL Season

Name

Team

Distance

Name

Team

Distance

N. Rackers

Arizona

T. Mehlhaff

New Orleans

J. Elam

Atlanta

J. Carney

New York Giants

J. Kasay

Carolina

L. Tynes

New York Giants

R. Gould

Chicago

D. Akers

Philadelphia

N. Folk

Dallas

J. Nedney

San Francisco

J. Hanson

Detroit

O. Mare

Seattle

M. Crosby

Green Bay

J. Brown

St. Louis

R. Longwell

Minnesota

M. Bryant

Tampa Bay

M. Gramatica

New Orleans

S. Suisham

Washington

G. Hartley

New Orleans

Source: ESPN.com.

Comparing two data sets

using dotplots.

60 Chapter 2 Organizing and Graphing Data

Solution Figure 2.20 shows the dotplots for the field goal distances for the kickers in both
NFL conferences.

19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57

AFC kickers

• • • •

— I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 H

19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57

NFC kickers

Figure 2.20 Stacked dotplot of longest field goals made by kickers in the AFC and NFC during
the 2008 NFL season.

Looking at the stacked dotplot, we see that the majority of the distances fall within a range of
7 yards. In the AFC, the range is 50 to 57 yards, whereas in the NFC, the range is 47 to 54
yards. Both conferences have one outlier, with the shortest distance being 19 yards.

In practice, dotplots and other statistical graphs will be created using statistical software.
The Technology Instruction section at the end of this chapter shows how we can do so.

EXERCISES

CONCEPTS AND PROCEDURES

2.56 Briefly explain how to prepare a dotplot for a data set. You may use an example to illustrate.

2.57 What is a stacked dotplot, and how is it used? Explain.

2.58 Create a dotplot for the following data set.

■ APPLICATIONS

2.59 Reconsider the data on the numbers of computer keyboards assembled at the Twentieth Century Elec-
tronics Company given in Exercise 2.20. Create a dotplot for those data.

2.60 Create a dotplot for the data on the number of turnovers (fumbles and interceptions) by a college
football team for games in the past two seasons given in Exercise 2.28.

2.61 Reconsider the data on the numbers of errors found in 25 randomly selected credit reports given in
Exercise 2.29. Create a dotplot for those data.

2.62 The following data give the number of times each of the 30 randomly selected account holders at a
bank used that bank's ATM during a 60-day period.

Create a dotplot for these data and point out any clusters or outliers.

2.7 Dotplots 61

2.63 The following data give the number of times each of the 20 randomly selected male students from
a state university ate at fast-food restaurants during a 7-day period.

58 10 355 10 72 1

10 450 10 12835

Create a dotplot for these data and point out any clusters or outliers.

2.64 Reconsider Exercise 2.63. The following data give the number of times each of the 20 randomly
selected female students from the same state university ate at fast-food restaurants during the same 7-day
period.

0424 10 2505

6 114624560

a. Create a dotplot for these data.

b. Use the dotplots for male and female students to compare the two data sets.

2.65 The following table gives the number of times each of the listed players of the 2008 Philadelphia
Phillies baseball team was hit by a pitch (HBP). The list includes all players with at least 150 at-bats.

Player

HBP

Player

HBP

C. Utley

R. Howard

C. Coste

P. Burrell

S. Victorino

G. Dobbs

J. Rollins

G. Jenkins

C. Ruiz

P. Feliz

J. Werth

T. Iguchi

E. Bruntlett

Source: Major League

Baseball, 2008.

Create a dotplot for these data. Mention any clusters and/or outliers you observe.

USES AND MISUSES... TRUNCATING THE axes

Graphical analyses are an important part of statistics. However, graph-
ics software allows people to change the appearance of any graph.
Thus, one needs to be careful when reading a graph in order to in-
terpret the data properly. As an example, consider the following data
on the number of identity theft cases per 1000 people in eight states
during the year 2005.

State

Identity Theft Cases
per 1000 People

Arizona

156.9

Nevada

130.2

California

125.0

Texas

116.5

Colorado

97.2

Florida

95.8

Washington

92.4

New York

90.3

A proper bar graph of these data would look like the one shown in
Figure 2.21.

160
140
120
100

60
40
20

Chart of Identity Theft Cases vs State

Source: Identity Theft Security, 2006.

Figure 2.21 Identity theft cases for eight states.

62 Chapter 2 Organizing and Graphing Data

In addition to being able to obtain approximate rates by look-
ing at the vertical axis, we can determine the relative sizes of the
rates by looking at the bars. For example, we can see that the rates
in Colorado and Florida are approximately the same, whereas the
rate in Arizona is approximately 50% higher than the rates in Wash-
ington and New York.

We can present the same data using a bar graph similar to the
one shown in Figure 2.22. The data have not been altered in this
graph. However, the vertical axis has been changed so that it begins
at 80 instead of at zero. As a result, many people will look at this
graph and see very different relationships among the states than the
relationships that actually exist. For example, the bar representing
Arizona appears to be eight times the height of the bar for New York.
This would correspond to an identity theft rate that is 700% higher
for Arizona than for New York when it is actually only 50% higher.
Similarly, the rate for Texas appears to be 300% higher than for
New York when it is really only slightly more than 30% higher.

that the above data on identity thefts were displayed in a pie chart
such as the one shown here.

Figure 2.22 Identity theft cases for eight states.

There are many reasons why the second version of the bar chart
might be used. In some circumstances, the creator of the graph may
not know any better. In other cases, it could be that the user has
ulterior motives, such as to justify increasing funding for fighting iden-
tity theft or to make a state or a region to appear to have a lower
rate than what it truly is.

Problematic graphs also occur because the user does not under-
stand the purpose of a specific type of graph. For example, suppose

Pie Chart of Identity Theft Cases vs State

1 Floret*
MY&rfc

There are many possible misinterpretations of the data as a
result of using this pie chart. For example, a person might look at
the graph and believe that 17.4% of all identity theft cases occur
in Arizona. There are two major problems with this interpretation.
First, only eight states are represented in this chart, so all of the
percentages are overstated with respect to the entire United States.
Second, and more important, the percentages are based on the
sums of the state rates per 1000 people. This cannot be done un-
less all of the states have the same population.

California and Arizona can be used to illustrate the latter point.
If you look at the pie chart, it would appear that the number of
identity theft cases in Arizona is higher than the number of cases
in California. However, the rate of identity theft is affected by the
population size. California had 45,175 cases, while Arizona had
9320. The 2005 estimated populations of the two states were
36.1 million in California and 5.9 million in Arizona. California had
slightly less than five times as many cases as Arizona, but more than
six times as many people, which is why Arizona's rate is higher than
California's rate.

As you are learning about different types of graphs in statis-
tics, it is important to know how to make them and how to inter-
pret them; otherwise you can be misled or deceived by them.

Glossary

Bar graph A graph made of bars whose heights represent the fre-
quencies of respective categories.

Class An interval that includes all the values in a (quantitative)
data set that fall within two numbers, the lower and upper limits of
the class.

Class boundary The midpoint of the upper limit of one class and
the lower limit of the next class.

Class frequency The number of values in a data set that belong
to a certain class.

Class midpoint or mark The class midpoint or mark is obtained
by dividing the sum of the lower and upper limits (or boundaries)
of a class by 2.

Class width or size The difference between the two boundaries of
a class.

Supplementary Exercises 63

Cumulative frequency The frequency of a class that includes
all values in a data set that fall below the upper boundary of that
class.

Cumulative frequency distribution A table that lists the total
number of values that fall below the upper boundary of each class.

Cumulative percentage The cumulative relative frequency multi-
plied by 100.

Cumulative relative frequency The cumulative frequency of a
class divided by the total number of observations.

Frequency distribution A table that lists all the categories or
classes and the number of values that belong to each of these cate-
gories or classes.

Grouped data A data set presented in the form of a frequency
distribution.

Histogram A graph in which classes are marked on the horizon-
tal axis and frequencies, relative frequencies, or percentages are
marked on the vertical axis. The frequencies, relative frequencies, or
percentages of various classes are represented by bars that are drawn
adjacent to each other.

Ogive A curve drawn for a cumulative frequency distribution.

Outliers or Extreme values Values that are very small or very
large relative to the majority of the values in a data set.

Percentage The percentage for a class or category is obtained by
multiplying the relative frequency of that class or category by 100.

Pie chart A circle divided into portions that represent the relative fre-
quencies or percentages of different categories or classes.

Polygon A graph formed by joining the midpoints of the tops of
successive bars in a histogram by straight lines.

Raw data Data recorded in the sequence in which they are col-
lected and before they are processed.

Relative frequency The frequency of a class or category divided
by the sum of all frequencies.

Skewed-to-the-left histogram A histogram with a longer tail on
the left side.

Skewed-to-the-right histogram A histogram with a longer tail on
the right side.

Stem-and-leaf display A display of data in which each value is
divided into two portions — a stem and a leaf.

Symmetric histogram A histogram that is identical on both sides
of its central point.

Ungrouped data Data containing information on each member of
a sample or population individually.

Uniform or rectangular histogram A histogram with the same
frequency for all classes.

Supplementary Exercises

2.66 The following data give the political party of each of the first 30 U.S. presidents. In the data, D stands
for Democrat, DR for Democratic Republican, F for Federalist, R for Republican, and W for Whig.

a. Prepare a frequency distribution table for these data.

b. Calculate the relative frequency and percentage distributions.

c. Draw a bar graph for the relative frequency distribution and a pie chart for the percentage
distribution.

d. What percentage of these presidents were Whigs?

2.67 In a November 2008 Harris Poll, U.S. adults were asked "Will the Obama administration be too lib-
eral or conservative?" Of the respondents, 35% said that it will be too liberal (L), 43% said that it will be
neither too liberal nor too conservative (N), 4% said that it will be too conservative (C), and 18% said that
they do not know (K). In a recent poll, 40 people were asked whether the Obama administration has been
too liberal or too conservative. Their responses are given below.

a. Prepare a frequency distribution for these data.

b. Calculate the relative frequencies and percentages for all classes.

c. Draw a bar graph for the frequency distribution and a pie chart for the percentage distribution.

d. What percentage of these respondents said "too liberal"?

2.68 The following data give the numbers of television sets owned by 40 randomly selected households.

1 1 2 3 2 4 1 3 2 1

3021232322
12 1113 1112
2423131224

64 Chapter 2 Organizing and Graphing Data

a. Prepare a frequency distribution table for these data using single-valued classes.

b. Compute the relative frequency and percentage distributions.

c. Draw a bar graph for the frequency distribution.

d. What percentage of the households own two or more television sets?

2.69 Twenty-four students from universities in Connecticut were asked to name the five current members
of the U.S. House of Representatives from Connecticut. The number of correct names supplied by the stu-
dents are given below.

423554315443
532313252150

a. Prepare a frequency distribution for these data using single- valued classes.

b. Compute the relative frequency and percentage distributions.

c. What percentage of the students in this sample named fewer than two of the representatives
correctly?

d. Draw a bar graph for the relative frequency distribution.

2.70 The following data give the amounts spent on video rentals (in dollars) during 2009 by 30 house-
holds randomly selected from those who rented videos in 2009.

595

100

622

405

155

760

405

205

180

808

100

240

127

310

350

160

111

a. Construct a frequency distribution table. Take $1 as the lower limit of the first class and $200
as the width of each class.

b. Calculate the relative frequencies and percentages for all classes.

c. What percentage of the households in this sample spent more than $400 on video rentals
in 2009?

2.71 The following data give the numbers of orders received for a sample of 30 hours at the Timesaver
Mail Order Company.

47 38

53 57

46 36

a. Construct a frequency distribution table. Take 23 as the lower limit of the first class and 7 as
the width of each class.

b. Calculate the relative frequencies and percentages for all classes.

c. For what percentage of the hours in this sample was the number of orders more than 36?

2.72 The following data give the amounts (in dollars) spent on refreshments by 30 spectators randomly
selected from those who patronized the concession stands at a recent Major League Baseball game.

4.95

27.99

8.00

5.80

4.50

2.99

4.85

6.00

9.00

15.75

9.50

3.05

5.65

21.00

16.60

18.00

21.77

12.35

7.75

10.45

3.85

28.45

8.35

17.70

19.50

11.65

11.45

3.00

6.55

16.50

a. Construct a frequency distribution table using the less-than method to write classes. Take $0 as
the lower boundary of the first class and $6 as the width of each class.

b. Calculate the relative frequencies and percentages for all classes.

c. Draw a histogram for the frequency distribution.

2.73 The following data give the repair costs (in dollars) for 30 cars randomly selected from a list of cars
that were involved in collisions.

2300

750

2500

410

555

1576

2460

1795

2108

897

989

1866

2105

335

1344

1159

1236

1395

6108

4995

5891

2309

3950

6655

4900

1320

2901

1925

6896

a. Construct a frequency distribution table. Take $1 as the lower limit of the first class and $1400
as the width of each class.

b. Compute the relative frequencies and percentages for all classes.

c. Draw a histogram and a polygon for the relative frequency distribution.

d. What are the class boundaries and the width of the fourth class?

Supplementary Exercises 65

2.74 Refer to Exercise 2.70. Prepare the cumulative frequency, cumulative relative frequency, and
cumulative percentage distributions by using the frequency distribution table of that exercise.

2.75 Refer to Exercise 2.71. Prepare the cumulative frequency, cumulative relative frequency, and cumulative
percentage distributions using the frequency distribution table constructed for the data of that exercise.

2.76 Refer to Exercise 2.72. Prepare the cumulative frequency, cumulative relative frequency, and
cumulative percentage distributions using the frequency distribution table constructed for the data of
that exercise.

2.77 Construct the cumulative frequency, cumulative relative frequency, and cumulative percentage dis-
tributions by using the frequency distribution table constructed for the data of Exercise 2.73.

2.78 Refer to Exercise 2.70. Prepare a stem-and-leaf display for the data of that exercise.

2.79 Construct a stem-and-leaf display for the data given in Exercise 2.71.

2.80 The following table gives the 2008 endowments (in billions of dollars) for the six U.S. universities
that had the largest endowments.

Endowment

University (billions of dollars)

Harvard University 36.6

Yale University 22.9

Stanford University 17.2

Princeton University 16.3

University of Texas System 16.1

Massachusetts Institute of Technology 10.1
Source: National Association of College and University Business Officers, 2008.

Draw two bar graphs for these data — the first without truncating the axis on which endowments are marked,
and the second by truncating this axis. In the second graph, mark the endowments on the vertical axis
starting with $9 billion. Briefly comment on the two bar graphs.

2.81 The following table lists the average price per gallon for unleaded regular gasoline in the United
States from 1999 to 2008.

Year

Average Price per Gallon
(dollars)

1999

1.136

2000

1.484

2001

1.420

2002

1.345

2003

1.561

2004

1.852

2005

2.270

2006

2.572

2007

2.796

2008

3.246

Source: Energy Information Administration, 2008.

Draw two bar graphs for these data — the first without truncating the axis on which price is marked, and
the second by truncating this axis. In the second graph, mark the prices on the vertical axis starting with
$1.00. Briefly comment on the two bar graphs.

2.82 Reconsider the data on the times (in minutes) taken to commute from home to work for 20 workers
given in Exercise 2.53. Create a dotplot for those data.

2.83 Reconsider the data on the numbers of orders received for a sample of 30 hours at the Timesaver
Mail Order Company given in Exercise 2.71. Create a dotplot for those data.

66 Chapter 2 Organizing and Graphing Data

2.84 Twenty-four students from a university in Oregon were asked to name the five current members of
the U.S. House of Representatives from their state. The following data give the numbers of correct names
given by these students.

551245315501
235431525453

Create a dotplot for these data.

2.85 The following data give the numbers of visitors during visiting hours on a given evening for each of
the 20 randomly selected patients at a hospital.

3 1 4 2 4 1 1 3

4202221 130

Create a dotplot for these data.

Advanced Exercises

2.86 The following frequency distribution table gives the age distribution of drivers who were at fault in
auto accidents that occurred during a 1-week period in a city.

Age (years) /

18 to less than 20 7

20 to less than 25 12

25 to less than 30 18

30 to less than 40 14

40 to less than 50 15

50 to less than 60 16

60 and over 35

a. Draw a relative frequency histogram for this table.

b. In what way(s) is this histogram misleading?

c. How can you change the frequency distribution so that the resulting histogram gives a clearer
picture?

2.87 Refer to the data presented in Exercise 2.86. Note that there were 50% more accidents in the 25 to
less than 30 age group than in the 20 to less than 25 age group. Does this suggest that the older group of
drivers in this city is more accident-prone than the younger group? What other explanation might account
for the difference in accident rates?

2.88 Suppose a data set contains the ages of 135 autoworkers ranging from 20 to 53 years.

a. Using Sturge's formula given in footnote 1 on page 37, find an appropriate number of classes
for a frequency distribution for this data set.

b. Find an appropriate class width based on the number of classes in part a.

2.89 Statisticians often need to know the shape of a population to make inferences. Suppose that you are
asked to specify the shape of the population of weights of all college students.

a. Sketch a graph of what you think the weights of all college students would look like.

b. The following data give the weights (in pounds) of a random sample of 44 college students
(F and M indicate female and male, respectively).

123 F

195 M

138 M

115 F

179 M

119 F

148 F

147 F

180 M

146 F

179 M

189 M

175 M

108 F

193 M

114 F

179 M

147 M

108 F

128 F

164 F

174 M

128 F

159 M

193 M

204 M

125 F

133 F

115 F

168 M

123 F

183 M

116 F

182 M

174 M

102 F

123 F

99 F

161 M

162 M

155 F

202 M

110 F

132 M

i. Construct a stem-and-leaf display for these data.

ii. Can you explain why these data appear the way they do?

c. Now sketch a new graph of what you think the weights of all college students look like. Is this
similar to your sketch in part a?

Supplementary Exercises 67

2.90 Consider the two histograms given in Figure 2.23, which are drawn for the same data set. In this
data set, none of the values are integers.

Histogram of CI

u
E

Histogram of CI

4 6
CI

Figure 2.23 Two histograms for the same data.

a. What are the endpoints and widths of classes in each of the two histograms?

b. In the first histogram, of the observations that fall in the interval that is centered at 8, how
many are actually between the left endpoint of that interval and 8? Note that you have to
consider both histograms to answer this question.

c. Observe the leftmost bars in both histograms. Why is the leftmost bar in the first histogram
misleading?

2.91 Refer to the data on weights of 44 college students given in Exercise 2.89. Create a dotplot of all 44
weights. Then create stacked dotplots for the weights of male and female students. Describe the similar-
ities and differences in the distributions of weights of male and female students. Using all three dotplots,
explain why you cannot distinguish the lightest males from the heaviest females when you consider only
the dotplot of all 44 weights.

2.92 The pie chart in Figure 2.24 shows the percentage distribution of ages (i.e., the percentages of
all prostate cancer patients falling in various age groups) for men who were recently diagnosed with
prostate cancer.

a. Are more or fewer than 50% of these patients in their 50s? How can you tell?

b. Are more or fewer than 75% of these patients in their 50s and 60s? How can you tell?

68 Chapter 2 Organizing and Graphing Data

Pie Chart of age group

Figure 2.24 Pie chart of age groups.

c. A reporter looks at this pie chart and says, "Look at all these 50-year-old men who are getting
prostate cancer. This is a major concern for a man once he turns 50." Explain why the reporter
cannot necessarily conclude from this pie chart that there are a lot of 50-year-old men with
prostate cancer. Can you think of any other way to present these cancer cases (both graph and
variable) to determine if the reporter's claim is valid?

2.93 Stem-and-leaf displays can be used to compare distributions for two groups using a back-to-back
stem-and-leaf display. In such a display, one group is shown on the left side of the stems, and the other
group is shown on the right side. When the leaves are ordered, the leaves increase as one moves away
from the stems. The following stem-and-leaf display shows the money earned per tournament entered for
the top 30 money winners in the 2008-09 Professional Bowlers Association men's tour and for the top 21
money winners in the 2008-09 Professional Bowlers Association women's tour.

Women's

Men's

8871

65544330

334456899

840

03344678

011237888

The leaf unit for this display is 100. In other words, the data used represent the earnings in hundreds of
dollars. For example, for women's tour, the first number is 08, which is actually 800. The second number
is 11, which actually is 1100.

a. Do the top money winners, as a group, on one tour (men's or women's) tend to make more
money per tournament played than on the other tour? Explain how you can come to this
conclusion using the stem-and-leaf display.

b. What would be a typical earnings level amount per tournament played for each of the two tours?

c. Do the data appear to have similar spreads for the two tours? Explain how you can come to this
conclusion using the stem-and-leaf display.

d. Does either of the tours appears to have any outliers? If so, what are the earnings levels for
these players?

Supplementary Exercises 69

2.94 The following table lists the earnings per event that were referred to in Exercise 2.93. Although the
table lists earnings per event, players are listed in order of their total earnings, not their earnings per event.
Note that men and women are ranked together in the table.

Name

Earnings per Event
(in dollars)

Name

Earnings per Event
(in dollars)

Txlr\i"m Tlnlrp

7JUO.U /

p lifTPtiP Mi^i^nnp
JH/UgCllC 1V1UV_,U.11C

94.75 88

Wpc A/Talntt

wt^a ±\±aL\Jii

879S 61

Rnnnip Rii*;qp11

2540 63

Pot n r*V A 1 1 pst-i

6Q7Q 41

1? i tr*tii p A 1 lpn

??4n nn

irifTC R artiPQ

S970 00

TapV Tnrpk"

J lILK ,1 Lll L IS.

2322.94

W/nlfpv Pqu \A/illinmc It*
vv ciiLCi JVdy vv liiidiiiA j i.

47S8 87

T 17 \r\rt ti cr^Ti

7snn nn

Rill O'Neill

4884 18

IVTipIipIIp PplHman

IvlldlL 11L 1 11.11 1 ltll 1

5214.29

T?nitir\ P'l fr p
iviiliiu rage

487? SO

i ni"r\1\/n 1 ~1 Mnn_ R nil qtH
v^aHJlyll -L/tJllU JJOllctlU

51 85 71

John Nolen

4801.56

Stclanie Nation

4542.86

IVllJvC OL1 UgglllS

tju / .UU

Tpfl T1 1 TPl" PptVIpL"

J C111111CI .F CL1 ll<Iv

4285.71

Rrnrl An<rp1r\
JJld.Ll rtllgCHJ

429 1.18

Tnni AA/npccnpi"
J UU1 VV UCoallCl

1885 71

Rpfp W/p* r^^T"
r CLC VV C UCL

4HS 90

v n cinTir\n Pin nfm/clri/

Parlrpr Rnhn TTT

I til KCI JDUIlll 111

4101.47

lV^iccv Rpllinrlpi"
iviisay .dciiiiiu.ci

7^86 75

IVllL-lldCl ragall

^851 76

Tiifinrli"ci A cnihr
IvldllUl a rt.SUd.ly

954? 86

S tp vp T-f n nn an

JIL VL llUllllilll

4035 63

Tncha RpiH

111, Mill IvtlU

?4nn nn

Z.'-rVJVJ . \J\J

TnmiTu; Tnnpc
1U111111 y J UUCft

J / 1J.OO

\A/pnrl\7 \\A Qcnnprenr
VVCllUy lVld.UUllCl ftUll

105fS 00

Flatinv WKPmari
L'ciiiiiy vv i >*w iiiuii

f^lar:! friiprrprn

?4<S(S 67

3399.41

Shalin 7nlkifli

Olltlllll ^jUIIVIIII

2098.57

rVTik"fi TCmviiniPTTii

IVXIJVtl 1VU1 V LI 111^1111

3396.47

Tpnnpllp lVTillipati

1 I. 1 1 1 1 k. 1 1 k. 1V111I1_,U1

2331.67

Jeff Carter

3410.94

Shannon O'Keefe

2640.00

Michael Machuga

3455.33

Joy Esterson

1807.14

Ryan Shafer

2983.53

Adrienne Miller

1798.57

Mike Wolfe

2902.35

Brenda Mack

1833.33

Steve Jaros

2884.12

Olivia Sandham

1100.00

Chris Loschetter

3035.63

Amy Stolz

2500.00

Mike DeVaney

2681.76

Kelly Kulick

830.00

Ken Simard

2412.19

Source: Professional Bowlers Associaton, April 13, 2009.

A graph that is similar to an ogive is a graph of the empirical cumulative distribution function (CDF). The
primary difference between an ogive and an empirical CDF is that the empirical CDF looks like a set of
steps, as opposed to a set of slanted lines. The height of each step corresponds to the percentage of
observations that occur at a specific value. Longer (not higher) steps occur when there are bigger gaps
between observations.

a. Figures 2.25(a) and (b) contain the empirical CDFs of the earnings per event for the two tours
(men's and women's), in some order. In other words, one of these two figures is for the men's
tour, and the other is for the women's tour, but not in that order necessarily. Match the CDFs to
the respective tours. Give three reasons for your choices.

b. Both distributions are skewed to the right. Use the information about longer steps to explain
why the distributions are skewed to the right.

c. What are the approximate values of the CDFs corresponding to $3000 per tournament played and
$4000 per tournament played? Based on this information, what is the approximate percentage of
bowlers who earned between $3000 and $4000 per tournament played?

2.95 Table 2.18 contains the differences in the obesity rates (called rate change in the table) for the years be-
tween 2007 and 1997 for each of the 50 states and the District of Columbia. The obesity rate is the percent-
age of people having a body mass index (BMI) of 30 or higher. Figure 2.26 contains a dotplot of these data,
a. Analyze the dotplot carefully. What value would you provide if asked to report a "typical" obe-
sity rate change? Why did you choose this value?

70 Chapter 2 Organizing and Graphing Data

Figure 2.25(a)

Figure 2.25(6)

Dot plot of Obesity Rate Change

S Ifaaaaflaaal • •
— * • mjf a a#a*aaa*a*aaa* » • ■ ■ *L_

6.0 7.2 8.4 9.6 10.8 12.0 13.2

Obesity Rate Change

Figure 2.26 Dotplot of obesity rate changes (year 2007 minus year 1997).

Supplementary Exercises 71

Table 2.18 Difference in 2007 and 1997 Obesity Rates, by State

Rate

State

Change

State

Change

State

Change

12.1

5.6

9.5

7.8

in t

10.2

9.8

8.6

11).

/.y

6. 1

6.6

6.5

9.6

6.9

O 1

8.4

7.6

C 1

6.5

9. 1

i 1 c
11.5

5.6

A /TC

1 A

cn
SU

9.2

D.C.

7..i

8.4

IZA

7.5

7.2

9.4

1 3.0

MX?

U 1

6.6

7 S

VT
V 1

8.2

10.2

7.9

7.8

7.5

10.1

5.6

9.1

8.9

7.5

8.1

12.2

9.7

8.7

Source: Centers for Disease Control and Prevention, July 24, 2008.

b. What number do you feel most accurately represents the number of outliers in this data set: 0,
1, 3, 4, 6, 9, or 10? Explain your reasoning, including the identification of the observations, if
any, that you feel are outliers.

c. Would you classify this distribution as being skewed to the left, skewed to the right, or approxi-
mately symmetric? Explain.

d. The largest increase in the obesity rate during this period took place in Georgia (13.8), whereas
the smallest increase took place in Vermont (5.4). Explain why this information should not lead
you to conclude that Georgia had the highest obesity rate in 2007 and that Vermont had the
lowest obesity rate in 2007. (Note: The highest and lowest obesity rates in 2007 were in
Mississippi and Colorado, respectively.)

2.96 Figure 2.27 contains stacked dotplots of 2007 state obesity rates by different geographic regions —
Midwest, Northeast, South, and West.

Dotplot of 2007 Obesity Rate

Midwest

Northeast

South
West

_l ■ — —

19.8

21.6

23.4 25.2 27.0
2007 Obesity Rate

28.8

30.6

Figure 2.27 2007 state obesity rates by geographic region.

72 Chapter 2 Organizing and Graphing Data

a. Which region has the least variability (greatest consistency) of obesity rates? Which region has
the most variability (least consistency) of obesity rates? Justify your choices.

b. Which region tends to have the highest obesity rates? Which region tends to have the lowest
obesity rates? Justify your choices.

c. Are there any regions that have at least one obesity rate that could be considered an outlier? If
so, specify the region(s) and the observation(s).

2.97 CBS Sports had a Facebook page for the 2009 NCAA Men's Basketball Tournament including bracket
contests, discussion sites, and a variety of polls. One of the polls asked users to identify their most de-
spised teams. The following pie chart (Figure 2.28) gives a breakdown of the votes by the conference of
the most despised teams as of 10:53 EDT on March 16, 2009.

Pie Chart of Conference

Big Ten

Figure 2.28 Pie chart of conference of the most despised NCAA men's
basketball team.

a. Are there any conferences that received more than 25% of the votes? If so, which conference(s)?
How can you tell?

b. Which two conferences appear to have the closest percentages of the votes?

c. A bar chart for the same data is presented in Figure 2.29. Comparing the bar chart to the pie
chart, match the conferences to the bars. In other words, explain which bar represents which
conference.

Chart of Votes

90000
80000-
70000"
60000

■g 5O0M

4O000 i 1

30000
2MO0

Tn I.M.I I

abcdefgh
Conference

Figure 2.29 Bar chart of conference of the most despised NCAA
men's basketball team.

Self-Review Test 73

Self-Review Test

1. Briefly explain the difference between ungrouped and grouped data and give one example of each
type.

2. The following table gives the frequency distribution of times (to the nearest hour) that 90 fans spent
waiting in line to buy tickets to a rock concert.

Waiting Time

(hours)

Frequency

Oto 6

7 to 13

14 to 20

21 to 27

28 to 34

Circle the correct answer in each of the following statements, which are based on this table.

a. The number of classes in the table is 5, 30, 90.

b. The class width is 6, 7, 34.

c. The midpoint of the third class is 16.5, 17, 17.5.

d. The lower boundary of the second class is 6.5, 7, 7.5.

e. The upper limit of the second class is 12.5, 13, 13.5.

f. The sample size is 5, 90, 11.

g. The relative frequency of the second class is .22, .41, .30.

3. Briefly explain and illustrate with the help of graphs a symmetric histogram, a histogram skewed to
the right, and a histogram skewed to the left.

4. Twenty elementary school children were asked if they live with both parents (B), father only (F),
mother only (M), or someone else (S). The responses of the children follow.

MBBMF SBMFM
B FBMMBBF BM

a. Construct a frequency distribution table.

b. Write the relative frequencies and percentages for all categories.

c. What percentage of the children in this sample live with their mothers only?

d. Draw a bar graph for the frequency distribution and a pie chart for the percentages.

5. A large Midwestern city has been chronically plagued by false fire alarms. The following data set
gives the number of false alarms set off each week for a 24-week period in this city.

10 4 8 7 3 7 10 2 6 12 11 8

1 6 5 13 9 7 5 1 14 5 15 3

a. Construct a frequency distribution table. Take 1 as the lower limit of the first class and 3 as the
width of each class.

b. Calculate the relative frequencies and percentages for all classes.

c. What percentage of these weeks had 9 or fewer false alarms?

d. Draw the frequency histogram and polygon.

6. Refer to the frequency distribution prepared in Problem 5. Prepare the cumulative percentage distri-
bution using that table. Draw an ogive for the cumulative percentage distribution.

7. Construct a stem-and-leaf display for the following data, which give the times (in minutes) 24 cus-
tomers spent waiting to speak to a customer service representative when they called about problems
with their Internet service provider.

12 15 7 29 32 16 10 14 17 8 19 21

4 14 22 25 18 6 22 16 13 16 12 20

74 Chapter 2 Organizing and Graphing Data

8. Consider this stem-and-leaf display:

2 4

1 3

1 9

Write the data set that was used to construct this display.
9. Make a dotplot for the data given in Problem 5.

Mini-Projects

■ MINI-PROJECT 2-1

Using the data you gathered for the mini-project in Chapter 1, prepare a summary of that data set that in-
cludes the following.

a. Prepare an appropriate type of frequency distribution table for one of the quantitative variables
and then compute relative frequencies and cumulative relative frequencies.

b. Create a histogram, a stem-and-leaf display, and a dotplot of the data. Comment on any symme-
try or skewness and on the presence of clusters and any potential outliers.

c. Make stacked dotplots of the same variable (as in parts a and b) based on the values of one of
your categorical variables. For example, if your quantitative variable is GPAs of students, your cat-
egorical variable could be gender. Comment on the similarities and differences between the dis-
tributions for the different values of your categorical variable.

■ MINI-PROJECT 2-2

Choose 15 of each of two types of magazines (news, sports, fitness, entertainment, and so on) and record
the percentage of pages that contain at least one advertisement. Using these percentages and the types of
magazines, write a brief report that covers the following:

a. Prepare an appropriate type of frequency distribution table for the quantitative variable and then
compute relative frequencies and cumulative relative frequencies.

b. Create a histogram, a stem-and-leaf plot, and a dotplot of all of the data. Comment on any sym-
metry or skewness, as well as the presence of clusters and any potential outliers.

c. Make stacked dotplots of the same variable for each of the two types of magazines. Comment on
the similarities and differences between the distributions for the two types of magazines.

DECIDE FOR YOURSELF

Deciding About Statistical Properties

Look around you. Graphs are everywhere. Business reports, news-
papers, magazines, and so forth are all loaded with graphs.
Unfortunately, some people feel that the primary purpose of
graphs is to provide a break from the humdrum text. Executive sum-
maries will often contain graphs so that CEOs and executive vice
presidents need only to glance at these graphs to assume that they
understand everything without reading more than a paragraph or so of
the report. In reality, the usefulness of graphs is somewhere between
the fluff of the popular press and the quick answer of the boardroom.

Here you are asked to interpret some graphs, primarily by using
them to compare distributions of a variable. As we will discuss in
Chapter 3, some of our concerns have to do with the location of the
center of a distribution and the variability or spread of a distribution.
We can use graphs to compare the centers and variability of two or
more distributions.

In practice, the graphs are made using statistical software, so it
is important to recognize that computer software is programmed to
use the same format for each graph of a specific type, unless you tell
the software to do differently. For example, consider the two his-
tograms in Figures 2.30 and 2.31 that are drawn for two different
data sets.

1. Examine the two graphs of Figures 2.30 and 2.31.

2. Explain what is meant by the statement "the shapes of the two
distributions are the same."

3. Does the fact that the shapes of the two distributions are the
same imply that the centers of the two distributions are the same?
Why or why not? Explain.

4. Does the fact that the shapes of the two distributions are the
same imply that the spreads of the two distributions are the same?
Why or why not? Explain.

Technology Instruction 75

Figure 2.30 Histogram of data temp 1.

5. It turns out that the same variable was represented in the two
graphs but with different units of measurement. Can you figure out
the units?

Another situation that is important to compare is when two
graphs cover a similar range but have different shapes, such as the
histograms in Figures 2.32 and 2.33.

Histogram of Example 2a

Example 2a

Figure 2.32 Histogram of example 2a.

Figure 2.31 Histogram of data temp 2.

1. Examine the two histograms of Figures 2.32 and 2.33.

2. These two distributions have the same center but do not have
the same spread. Decide which distribution has the larger spread and
explain the reasoning behind your decision.

Answer all the above questions again after reading Chapter 3.

Histogram of Example 2b

Example 2b

Figure 2.33 Histogram of example 2b.

ECHNOLOGY

INSTRUCTION

jmS fVAZ PlOtS

aSroff

Type: L^: Li
>o<- >njH

Xlist:|_i
Fre-=i: 1_

Screen 2.1

Organizing Data

1. To create a frequency histogram for a list of data, press STAT PLOT, which you access
by pressing 2nd > Y=. The Y= key is located at the top left of the calculator buttons.

2. Make sure that only one plot is turned on. If more than one plot is turned on, you can turn
off the unwanted plots by using the following steps. Press the number corresponding to the
plot you wish to turn off. A screen similar to Screen 2.1 will appear. Use the arrow keys
to move the cursor to the Off button, then press ENTER. Now use the arrow keys to move

76_ Chapter 2 Organizing and Graphing Data

to the row with Plotl, Plot2, and Plot3. If there is another plot that you need to turn off,
select that plot by moving the cursor to that plot, pressing ENTER, and repeating the previ-
ous procedure. If not, move the cursor to the plot you wish to use and press ENTER.

3. In the Type rows, use the right arrow to move to the third column in the first row that
looks like a histogram, and press Enter. Move to Xlist to enter the name of the list where
the data are located. Press 2nd > Stat, then use the up and down arrows to move through
the list names until you find the list you want to use. Press ENTER. Leave the Freq set-
ting at 1. (Note: if you are using one of the lists named LI, L2, L3, L4, L5, or L6, you
can enter the list name by pressing 2nd followed by one of the numbers 1 through 6, as
they correspond to the list names LI through L6.)

4. To see the graph, select ZOOM > 9 (the ZOOMSTAT function), where ZOOM is the
third key in the top row. This sets the window settings to display your graph.

5. If you would like to change the class width and/or the starting point of the first inter-
val, select WINDOW (see Screen 2.2). To change the class width, change the value of
Xscl to the desired width. To change the starting point of the first interval, change the
value of Xmin to the desired point. Press GRAPH, which is the fifth button in the top
row. (Note: After making either or both of these changes, you may need to change the
values of Xmax and Ymax to see the entire graph. The difference between Xmax and
Xmin should be a multiple of Xscl. As an example, if Xmin = 5 and Xscl = 10 and
the largest data point is 93, then Xmax should be set to 95 because 95 — 5 = 90,
which is a multiple of 10, and 95 is larger than the largest data point. The purpose of
changing Ymax is to be able to see the tops of the bars of the histogram. If the bars
run off the top of the calculator screen, increase Ymax, and press GRAPH.)

6. If you would like to see the interval endpoints and the number of observations in each class
(which is given by the height of the corresponding bar), press TRACE, then use the left
and right arrows to move from one bar to the next. When you are done, press CLEAR.

The functions for creating many common graphs are listed in the pulldown menu Graph.
The following instructions will demonstrate how to use Minitab to create two types of graphs
for categorical variables — a bar chart and a pie chart — and three types of graphs for quantita-
tive variables — a frequency histogram, a stem-and-leaf display, and a dotplot.

Bar Chart

1. If you have raw categorical data entered in a column (such as CI), select Graph > Bar
Chart. In the resulting dialog box, select Bars Represent: Counts of unique values and
Simple. Click OK. In the new dialog box, type CI in the box below Categorical Variables
and click OK.

2. If you have categorical data in a frequency table, with the categories entered in CI and the
frequencies in C2, select Graph > Bar Chart. In the resulting dialog box, select Bars
Represent: Values from a table and Simple. Click OK. In the new dialog box, type C2
in the box below Graph variables and CI in the box below Categorical Variable and
click OK.

Pie Chart

1. If you have raw categorical data entered in CI, select Graph > Pie Chart. In the resulting
dialog box, select Chart raw data, type CI in the box below Categorical Variables, and
click OK.

WINDOW
Xmin=5_
Xnax=95
Xscl=19
Vnin- -2. 16483
Vnax=15
Yscl=l

Xres=l

Screen 2.2

Technology Instruction 77

2. If you have categorical data in a frequency table, with the categories entered in CI and the
frequencies in C2, select Graph > Pie Chart. In the resulting dialog box, select Chart
values from a table, type C2 in the box below Summary variables and CI in the box
below Categorical Variable, and click OK.

Histogram - Simple

Graph variables:

Scale.,.

Labels,,

Data View,,

Multiple Graphs.,

Data Options,,

Help

Cancel

Frequency Histogram

For a quantitative data set entered in CI, select
Graph > Histogram, select Simple, and click
OK. In the resulting dialog box, type CI in the
box below Graph Variables (see Screen 2.3) and
click OK. Minitab will produce a separate window
that contains the histogram (see Screen 2.4).

Screen 2.3

Screen 2.4

QUI

Stem -and -Leaf Display: C I

Stem- and- leaf of CI U = 18
Leaf Unit = 1.0

3 122

9 556689

9 1 133

6 1 557S9

12 4

< Nl| j

Screen 2.5

Stem-and-Leaf Display

For a quantitative data set entered in CI, select Graph > Stem-and-Leaf, type CI in
the box below Graph Variables, and click OK. The display will appear in the Session
window (see Screen 2.5).

Dotplot

For a quantitative data set entered in CI, select Graph > Dotplot, select the appro-
priate dotplot from the choices, and click OK. In the resulting dialog box, type CI
in the box below Graph Variables and click OK. The dotplot will appear in a new
window.

78 Chapter 2 Organizing and Graphing Data

m a I b

Data Boundaries

Frequencies

1.1

=FREQUENCY(a3:a20,b3

:b3}

2.1

_5_

?:/

5.2

6.3

6.6

S.4

9.3

13.4

15,1

15.7

IS. 2

19.3

To create a frequency distribution for a range of numerical data in
Excel, decide how many categories you will have. Choose class bound-
aries between the categories so that you have one fewer boundary than
classes. Type the class boundaries into Excel.

Select where you want the class frequencies to appear, and select a
range of one more cell than the number of boundaries you have.

Type =frequency(.

Select the range of cells of numerical data, and then type a comma.

5. Select the range of class boundaries, and type a right parenthesis (see
Screen 2.6).

Screen 2.6

TECHNOLOGY ASSIGNMENTS

TA2.1 Construct a bar graph and a pie chart for the frequency distribution prepared in Exercise 2.5.

TA2.2 Construct a bar graph and a pie chart for the frequency distribution prepared in Exercise 2.6.

TA2.3 Refer to Data Set V that accompanies this text (see Preface and Appendix B) on the times taken
to run the Manchester Road Race for a sample of 500 participants. From that data set, select the 6th value,
and then select every 10th value after that (i.e., select the 6th, 16th, 26th, 36th, . . . values). This subsam-
ple will give you 50 measurements. (Such a sample selected from a population is called a systematic ran-
dom sample!) Construct a histogram for these data. Let the software you use decide on classes and class
limits.

TA2.4 Refer to Data Set I that accompanies this text on the prices of various products in different cities
across the country. Select a subsample of 60 from the column that contains information on pizza prices
and then construct a histogram for these data.

TA2.5 Construct a histogram for the data from Exercise 2.20 on the numbers of computer keyboards
assembled. Use the classes given in that exercise. Use the midpoints to mark the horizontal axis in the
histogram.

TA2.6 Prepare a stem-and-leaf display for the data given in Exercise 2.48.
TA2.7 Prepare a stem-and-leaf display for the data of Exercise 2.53.
TA2.8 Prepare a bar graph for the frequency distribution obtained in Exercise 2.28.
TA2.9 Prepare a bar graph for the frequency distribution obtained in Exercise 2.29.
TA2.10 Make a pie chart for the frequency distribution obtained in Exercise 2.19.
TA2.11 Make a pie chart for the frequency distribution obtained in Exercise 2.29.
TA2.12 Make a dotplot for the data of Exercise 2.64.
TA2.13 Make a dotplot for the data of Exercise 2.65.

During the 2008 season, among all baseball teams, the New York Yankees drew the highest av-
erage of spectators to the games. (See Case Study 3-1). Despite baseball being "America's National
Pastime," attendance at Major League Baseball games varies from team to team. What may attract
fans to baseball fields? Is it the number of championships won, market size, fan loyalty, or just love
for the team or game?

In Chapter 2 we discussed how to summarize data using different methods and to display data using
graphs. Graphs are one important component of statistics; however, it is also important to numerically
describe the main characteristics of a data set. The numerical summary measures, such as the ones
that identify the center and spread of a distribution, identify many important features of a distribution.
For example, the techniques learned in Chapter 2 can help us graph data on family incomes. However,
if we want to know the income of a "typical" family (given by the center of the distribution), the spread
of the distribution of incomes, or the relative position of a family with a particular income, the numeri-
cal summary measures can provide more detailed information (see Figure 3.1). The measures that
we discuss in this chapter include measures of (1) central tendency, (2) dispersion (or spread), and
(3) position.

3.1 Measures of Central
Tendency for Ungrouped
Data

Case Study 3-1 Average
Attendance at Baseball
Games

Case Study 3-2 The Gender
Pay Gap

3.2 Measures of Dispersion
for Ungrouped Data

3.3 Mean, Variance, and
Standard Deviation for
Grouped Data

3.4 Use of Standard
Deviation

Case Study 3-3 Here Comes
the SD

3.5 Measures of Position

3.6 Box-and-Whisker Plot

80 Chapter 3 Numerical Descriptive Measures

Figure 3.1

•* Spread

Center $76,260 Income

— Position of a
particular family

3.1 Measures of Central Tendency
for Ungrouped Data

We often represent a data set by numerical summary measures, usually called the typical val-
ues. A measure of central tendency gives the center of a histogram or a frequency distribu-
tion curve. This section discusses three different measures of central tendency: the mean, the
median, and the mode; however, a few other measures of central tendency, such as the trimmed
mean, the weighted mean, and the geometric mean, are explained in exercises following this
section. We will learn how to calculate each of these measures for ungrouped data. Recall from
Chapter 2 that the data that give information on each member of the population or sample
individually are called ungrouped data, whereas grouped data are presented in the form of a
frequency distribution table.

3.1.1 Mean

The mean, also called the arithmetic mean, is the most frequently used measure of central
tendency. This book will use the words mean and average synonymously. For ungrouped
data, the mean is obtained by dividing the sum of all values by the number of values in the
data set:

Sum of all values

Mean =

Number of values

The mean calculated for sample data is denoted by x (read as "x bar"), and the mean cal-
culated for population data is denoted by p (Greek letter mu). We know from the discussion
in Chapter 2 that the number of values in a data set is denoted by n for a sample and by N
for a population. In Chapter 1, we learned that a variable is denoted by x, and the sum of all
values of x is denoted by Xx Using these notations, we can write the following formulas for
the mean.

Calculating Mean for Ungrouped Data The mean for ungrouped data is obtained by
dividing the sum of all values by the number of values in the data set. Thus,

Mean for population data: p =

Mean for sample data: x = —

where Xx is the sum of all values, N is the population size, n is the sample size, p is the
population mean, and x is the sample mean.

3.1 Measures of Central Tendency for Ungrouped Data 81

■ EXAMPLE 3-1

Table 3.1 lists the total sales (rounded to billions of dollars) of six U.S. companies for 2008.
Table 3.1 2008 Sales of Six U.S. Companies

Calculating the sample mean
for ungrouped data.

Total Sales

Company (billions of dollars)

General Motors 149

Wal-Mart Stores 406

General Electric 183

Citigroup 107

Exxon Mobil 426

Verizon Communication 97

Find the 2008 mean sales for these six companies.

Solution The variable in this example is the 2008 total sales for a company. Let us denote
this variable by x. Then, the six values of x are

x x = 149, x 2 = 406, x 3 = 183, x 4 = 107, x 5 = 426, and x 6 = 97

where x x = 149 represents the 2008 total sales of General Motors, x 2 = 406 represents the
2008 total sales of Wal-Mart Stores, and so on. The sum of the 2008 sales for these six
companies is

Xx = X; + x 2 + x 3 + x 4 + x 5 + x 6

= 149 + 406 + 183 + 107 + 426 + 97 = 1368

Note that the given data include only six companies. Hence, they represent a sample. Because
the given data set contains six companies, n = 6. Substituting the values of Xx and n in the
sample formula, we obtain the mean 2008 sales of the six companies:

Xx 1368
x = — = = 228 = $228 Billion

n 6

Thus, the mean 2008 sales of these six companies was 228, or $228 billion. H

■ EXAMPLE 3-2

The following are the ages (in years) of all eight employees of a small company:

53 32 61 27 39 44 49 57
Find the mean age of these employees.

Solution Because the given data set includes all eight employees of the company, it repre-
sents the population. Hence, N = %. We have

Xx = 53 + 32 + 61 + 27 + 39 + 44 + 49 + 57 = 362

The population mean is

Xx 362

N 8

45.25 years

Thus, the mean age of all eight employees of this company is 45.25 years, or 45 years and
3 months.

Calculating the population
mean for ungrouped data.

82 Chapter 3 Numerical Descriptive Measures

Reconsider Example 3-2. If we take a sample of three employees from this company and
calculate the mean age of those three employees, this mean will be denoted by x. Suppose the
three values included in the sample are 32, 39, and 57. Then, the mean age for this sample is

32 + 39 + 57

= 42.67 years

x =

If we take a second sample of three employees of this company, the value of x will (most likely)
be different. Suppose the second sample includes the values 53, 27, and 44. Then, the mean age
for this sample is

53 + 27 + 44
x = = 41.33 years

Consequently, we can state that the value of the population mean ijl is constant. However, the
value of the sample mean x varies from sample to sample. The value of x for a particular sample
depends on what values of the population are included in that sample.

Sometime a data set may contain a few very small or a few very large values. As mentioned
in Chapter 2 on page 58, such values are called outliers or extreme values.

A major shortcoming of the mean as a measure of central tendency is that it is very sensitive
to outliers. Example 3-3 illustrates this point.

■ EXAMPLE 3-3

Table 3.2 lists the total philanthropic givings (in million dollars) by six companies during 2007.

Illustrating the effect of an
outlier on the mean.

Table 3.2 Philanthropic Givings of Six Companies
During 2007

Corporation

Money Given in 2007
(millions of dollars)

CVS

22.4

Best Buy

31.8

Staples

19.8

Walgreen

9.0

Lowe's

27.5

Wal-Mart

337.9

Notice that the charitable contributions made by Wal-Mart are very large compared to those
of other companies. Hence, it is an outlier. Show how the inclusion of this outlier affects the
value of the mean.

Solution If we do not include the charitable givings of Wal-Mart (the outlier), the mean of
the charitable contributions of the five companies is

22.4 + 31.8 + 19.8 + 9.0 + 27.5 110.5 .
Mean = = — - — = $22.1 million

Now, to see the impact of the outlier on the value of the mean, we include the contributions
of Wal-Mart and find the mean contributions of the six companies. This mean is

22.4 + 31.8 + 19.8 + 9.0 + 27.5 + 337.9 448.4 „
Mean = = = $74.73 million

Thus, including the contributions of Wal-Mart causes more than a threefold increase in the
value of the mean, which changes from $22.1 million to $74.73 million. I

USA TODAY Snapshots®

Packing them in

The Yankees and Mets led the majors in attendance
la st yea rand Figure to draw big crowds again this
season while opening new ballparks. Leading aver-
age attendance figures in 2008:

New York Yankees ^^^^^H 52,585

New York Mets 51,165
Los Angeles Dodgers | 1 46,059

St. Louis Cardinals 42,382
Philadelphia Phillies ^^^H 42,254

Sou rev: & Lis Sports En nrau

By Rent ftubtolebin*, Keucer*

AVERAGE
ATTEN-
DANCE AT
BASEBALL
GAMES

By Mott Yoone jni) Keith Simmons. USA TODAY

The accompanying chart shows five of the Major League Baseball teams that had the highest average

attendance during the 2008 season. According to the information given in the chart, the highest average

,. , , , . Source: USA TODAY, April 2, 2009.

per game. reproduced with permission.

The preceding example should encourage us to be cautious. We should remember that the
mean is not always the best measure of central tendency because it is heavily influenced by out-
liers. Sometimes other measures of central tendency give a more accurate impression of a data
set. For example, when a data set has outliers, instead of using the mean, we can use either the
trimmed mean (defined in Exercise 3.33) or the median (to be discussed next) as a measure of
central tendency.

3.1.2 Median

Another important measure of central tendency is the median. It is defined as follows.

Definition

Median The median is the value of the middle term in a data set that has been ranked in increas-
ing order.

As is obvious from the definition of the median, it divides a ranked data set into two equal
parts. The calculation of the median consists of the following two steps:

1. Rank the data set in increasing order.

2. Find the middle term. The value of this term is the median. 1

Note that if the number of observations in a data set is odd, then the median is given by
the value of the middle term in the ranked data. However, if the number of observations is even,
then the median is given by the average of the values of the two middle terms.

'The value of the middle term in a data set ranked in decreasing order will also give the value of the median.

84 Chapter 3 Numerical Descriptive Measures

Calculating the median
for ungrouped data: odd
number of data values.

■ EXAMPLE 3-4

The following data give the prices (in thousands of dollars) of seven houses selected from all
houses sold last month in a city.

312 257 421 289 526 374 497

Find the median.

Solution First, we rank the given data in increasing order as follows:

257 289 312 374 421 497 526

Since there are seven homes in this data set and the middle term is the fourth term, the me-
dian is given by the value of the fourth term in the ranked data.

257 289 312 374 421 497 526
t

Median

Thus, the median price of a house is 374, or $374,000. I

f<>>

Calculating the median
for ungrouped data: even
number of data values.

■ EXAMPLE 3-5

Table 3.3 gives the 2008 profits (rounded to billions of dollars) of 12 companies selected from
all over the world.

Table 3.3 Profits of 12 Companies for 2008

Company

2008 Profits
(billions of dollars)

Merck & Co

IBM

Unilever

Microsoft

Petrobras

Exxon Mobil

Lukoil

AT&T

Nestle

Vodafone

Deutsche Bank

China Mobile

Find the median for these data.

Solution First we rank the given profits as follows:

7 8 9 10 11 12 13 13 14 17 17 45

There are 12 values in this data set. Because there is an even number of values in the data set,
the median is given by the average of the two middle values. The two middle values are the
sixth and seventh in the foregoing list of data, and these two values are 12 and 13. The median,
which is given by the average of these two values, is calculated as follows.

7 8 9 10 11 12 13 13 14 17 17 45

Median

USA TODAY Snapshots®

The gender pay gap

Among workers 25 or older who
worked full time, year round in
2007, women made an average of
77 cents for every dollar earned by
men. Median earnings:

THE

GENDER
PAY CAP

The accompanying chart shows the median earnings of men and women aged 25 years and older for the

year 2007. These numbers are based on Census Bureau surveys done in 2007 but released in January 2009.

,. , . '. . , , Source: USA TODAY, February 9, 2009.

were $35,759, whereas those of men 25 years or older were $46,788.

reproduced with permission.

Median

12 + 13 25

12.5 = $12.5 billion

2 2

Thus, the median profit of these 12 companies is $12.5 billion. H

The median gives the center of a histogram, with half of the data values to the left of the
median and half to the right of the median. The advantage of using the median as a measure of
central tendency is that it is not influenced by outliers. Consequently, the median is preferred
over the mean as a measure of central tendency for data sets that contain outliers.

3.1.3 Mode

Mode is a French word that means fashion — an item that is most popular or common. In sta-
tistics, the mode represents the most common value in a data set.

Definition

Mode The mode is the value that occurs with the highest frequency in a data set.

■ EXAMPLE 3-6

The following data give the speeds (in miles per hour) of eight cars that were stopped on
T95 for speeding violations.

77 82 74 81 79 84 74 78

Find the mode.

Solution In this data set, 74 occurs twice, and each of the remaining values occurs only
once. Because 74 occurs with the highest frequency, it is the mode. Therefore,

Calculating the mode for
ungrouped data.

Mode = 74 miles per hour

86 Chapter 3 Numerical Descriptive Measures

A major shortcoming of the mode is that a data set may have none or may have more than one
mode, whereas it will have only one mean and only one median. For instance, a data set with each
value occurring only once has no mode. A data set with only one value occurring with the high-
est frequency has only one mode. The data set in this case is called unimodal. A data set with
two values that occur with the same (highest) frequency has two modes. The distribution, in this
case, is said to be bimodal. If more than two values in a data set occur with the same (highest)
frequency, then the data set contains more than two modes and it is said to be multimodal.

Data set with no mode.

■ EXAMPLE 3-7

Last year's incomes of five randomly selected families were $76,150, $95,750, $124,985,
$87,490, and $53,740. Find the mode.

Solution Because each value in this data set occurs only once, this data set contains no
mode. I

Data set with two modes.

■ EXAMPLE 3-8

Refer to the data on 2008 profits of 12 companies given in Table 3.3 of Example 3-5. Find
the mode for these data.

Solution In the data given in Example 3-5, each of the two values 13 and 17 occurs twice,
and each of the remaining values occurs only once. Therefore, that data set has two modes:
$ 1 3 billion and $ 1 7 billion. ■

Data set with three modes.

■ EXAMPLE 3-9

The ages of 10 randomly selected students from a class are 21, 19, 27, 22, 29, 19, 25, 21, 22,
and 30 years, respectively. Find the mode.

Solution This data set has three modes: 19, 21, and 22. Each of these three values occurs
with a (highest) frequency of 2. I

One advantage of the mode is that it can be calculated for both kinds of data — quantitative
and qualitative — whereas the mean and median can be calculated for only quantitative data.

■ EXAMPLE 3-10

The status of five students who are members of the student senate at a college are senior, soph-
omore, senior, junior, and senior, respectively. Find the mode.

Solution Because senior occurs more frequently than the other categories, it is the mode
for this data set. We cannot calculate the mean and median for this data set. I

To sum up, we cannot say for sure which of the three measures of central tendency is a bet-
ter measure overall. Each of them may be better under different situations. Probably the mean
is the most-used measure of central tendency, followed by the median. The mean has the ad-
vantage that its calculation includes each value of the data set. The median is a better measure
when a data set includes outliers. The mode is simple to locate, but it is not of much use in
practical applications.

Finding the mode for
qualitative data.

3.1.4 Relationships Among the Mean, Median, and Mode

As discussed in Chapter 2, two of the many shapes that a histogram or a frequency distribution
curve can assume are symmetric and skewed. This section describes the relationships among
the mean, median, and mode for three such histograms and frequency distribution curves.

3.1 Measures of Central Tendency for Ungrouped Data 87

Knowing the values of the mean, median, and mode can give us some idea about the shape of
a frequency distribution curve.

1. For a symmetric histogram and frequency distribution curve with one peak (see Figure 3.2),
the values of the mean, median, and mode are identical, and they lie at the center of the
distribution.

Figure 3.2 Mean, median, and mode for a symmetric
histogram and frequency distribution curve.

| Variable

Mean = median = mode

2. For a histogram and a frequency distribution curve skewed to the right (see Figure 3.3), the
value of the mean is the largest, that of the mode is the smallest, and the value of the me-
dian lies between these two. (Notice that the mode always occurs at the peak point.) The
value of the mean is the largest in this case because it is sensitive to outliers that occur in
the right tail. These outliers pull the mean to the right.

" Figure 3.3 Mean, median, and mode for a histogram

and frequency distribution curve skewed to the right.

| | \_ ^ Variable

Mode Median Mean

3. If a histogram and a frequency distribution curve are skewed to the left (see Figure 3.4),
the value of the mean is the smallest and that of the mode is the largest, with the value of
the median lying between these two. In this case, the outliers in the left tail pull the mean
to the left.

Figure 3.4 Mean, median, and mode for a histogram
and frequency distribution curve skewed to the left.

7_j t t

Mean Median Mode

Variable

CONCEPTS AND PROCEDURES

3.1 Explain how the value of the median is determined for a data set that contains an odd number of ob-
servations and for a data set that contains an even number of observations.

3.2 Briefly explain the meaning of an outlier. Is the mean or the median a better measure of central ten-
dency for a data set that contains outliers? Illustrate with the help of an example.

3.3 Using an example, show how outliers can affect the value of the mean.

Chapter 3 Numerical Descriptive Measures

3.4 Which of the three measures of central tendency (the mean, the median, and the mode) can be cal-
culated for quantitative data only, and which can be calculated for both quantitative and qualitative data?
Illustrate with examples.

3.5 Which of the three measures of central tendency (the mean, the median, and the mode) can assume
more than one value for a data set? Give an example of a data set for which this summary measure as-
sumes more than one value.

3.6 Is it possible for a (quantitative) data set to have no mean, no median, or no mode? Give an exam-
ple of a data set for which this summary measure does not exist.

3.7 Explain the relationships among the mean, median, and mode for symmetric and skewed histograms.
Illustrate these relationships with graphs.

3.8 Prices of cars have a distribution that is skewed to the right with outliers in the right tail. Which of
the measures of central tendency is the best to summarize this data set? Explain.

3.9 The following data set belongs to a population:

5 -7 2 -9 16 10 7
Calculate the mean, median, and mode.

3.10 The following data set belongs to a sample:

14 18 -1 08 8 -16
Calculate the mean, median, and mode.

■ APPLICATIONS

3.11 The following table gives the standard deductions and personal exemptions for persons filing with
"single" status on their 2009 state income taxes in a random sample of 10 states. Calculate the mean and
median for the data on standard deductions for these states.

State

Standard Deduction
(in dollars)

Personal Exemption
(in dollars)

Delaware

3250

110

Hawaii

2000

1040

Kentucky

2100

Minnesota

5450

3500

North Dakota

5450

3500

Oregon

1865

169

Rhode Island

5450

3500

Vermont

5450

3500

Virginia

3000

930

Source: TaxFoundation.org.

3.12 Refer to the data table in Exercise 3.1 1. Calculate the mean and median for the data on personal ex-
emptions for these states.

3.13 The following data give the 2007 gross domestic product (GDP) in billions of dollars for all 50 states.
The data are entered in alphabetic order by state (Bureau of Economic Analysis, June 2005).

166

247

1813

236

216

735 397

610

246

129

117

154

216

48 269

352

382

255

229

127

57 465

1103

399

466

139

158

531

47 153

244

1142

106

383

311

232 32

a. Calculate the mean and median for these data. Are these values of the mean and the median sam-
ple statistics or population parameters? Explain.

b. Do these data have a mode? Explain.

3.1 Measures of Central Tendency for Ungrouped Data 89

3.14 The following data give the 2008 profits (in millions of dollars) of the six Arizona-based companies
for the year 2008 (Fortune, May 5, 2008). The data represent the following companies, respectively:
Freeport-McMoRan Copper & Gold, Avnet, US Airways Group, Allied Waste Industries, Insight Enter-
prises, and PetSmart.

2977.0 393.1 427.0 273.6 77.8 258.7

Find the mean and median for these data. Do these data have a mode?

3.15 The following data give the 2006-07 team salaries for 20 teams of the English Premier League,
arguably the best-known soccer league in the world. The salaries are given in the order in which the teams
finished during the 2006-07 season. The salaries are in millions of British pounds (note that the approx-
imate value of 1 British pound was $1.95 during the 2006-07 season, so the team salaries range from
$34.3 million to $259 million). (Source: BBC, May 28, 2008.)

92.3

132.8

77.6

89.7

43.8

38.4

30.7

29.8

36.9

36.7

43.2

38.3

62.5

36.4

44.2

35.2

27.5

22.4

34.3

17.6

Find the mean and median for these data.

3.16 The following data give the numbers of car thefts that occurred in a city during the past 12 days.
637 11 4387269 15

Find the mean, median, and mode.

3.17 The following data give the revenues (in millions of dollars) for the last available fiscal year for a
sample of six charitable organizations for serious diseases (Charity Navigator, 2009). The values are, listed
in order, for the Alzheimer's Association, the American Cancer Society, the American Diabetes Associa-
tion, the American Heart Association, the American Lung Association, and the Cystic Fibrosis Foundation.

952 1129 231 668 49 149

Compute the mean and median. Do these data have a mode? Why or why not?

3.18 The following table gives the number of major penalties for each of the 15 teams in the Eastern Con-
ference of the National Hockey League during the 2008-09 season (NHL, 2009). A major penalty is sub-
ject to 5 minutes in the penalty box for a player.

Number of

Team

Major Penalties

Philadelphia

Columbus

Boston

Pittsburgh

New York Rangers

Tampa Bay

Nashville

Florida

Ottawa

Washington

Montreal

Atlanta

New York Islanders

Buffalo

Toronto

Compute the mean and median for the data on major penalties. Do these data have a mode? Why or why not?

3.19 Due to antiquated equipment and frequent windstorms, the town of Oak City often suffers power
outages. The following data give the numbers of power outages for each of the past 12 months.

4 57 3 2023 2 124

Compute the mean, median, and mode for these data.

90 Chapter 3 Numerical Descriptive Measures

3.20 A brochure from the department of public safety in a northern state recommends that motorists should
carry 12 items (flashlights, blankets, and so forth) in their vehicles for emergency use while driving in
winter. The following data give the number of items out of these 12 that were carried in their vehicles by
15 randomly selected motorists.

537801051 21 07 671 19

Find the mean, median, and mode for these data. Are the values of these summary measures population
parameters or sample statistics? Explain.

3.21 Nixon Corporation manufactures computer monitors. The following data are the numbers of com-
puter monitors produced at the company for a sample of 10 days.

24 32 27 23 35 33 29 40 23 28
Calculate the mean, median, and mode for these data.

3.22 The Tri-City School District has instituted a zero-tolerance policy for students carrying any objects
that could be used as weapons. The following data give the number of students suspended during each of
the past 12 weeks for violating this school policy.

1 59 1 21 17 6 9 1 01 43 6 5
Calculate the mean, median, and mode for these data.

3.23 The following data represent the numbers of tornadoes that touched down during 1950 to 1994 in
the 12 states that had the most tornadoes during this period (Storm Prediction Center, 2009). The data for
these states are given in the following order: CO, FL, IA, IL, KS, LA, MO, MS, NE, OK, SD, TX.

1113 2009 1374 1137 2110 1086 1166 1039 1673 2300 1139 5490

a. Calculate the mean and median for these data.

b. Identify the outlier in this data set. Drop the outlier and recalculate the mean and median. Which
of these two summary measures changes by a larger amount when you drop the outlier?

c. Which is the better summary measure for these data, the mean or the median? Explain.

3.24 The following data set lists the number of women from each of 10 different countries who were on
the Rolex Women's World Golf Rankings Top 25 list as of March 31, 2009. The data, entered in that or-
der, are for the following countries: Australia, Brazil, England, Japan, Korea, Mexico, Norway, Sweden,
Taiwan, and United States.

2 1 1 2 9 1 1 2 2 4

a. Calculate the mean and median for these data.

b. Identify the outlier in this data set. Drop the outlier and recalculate the mean and median. Which
of these two summary measures changes by a larger amount when you drop the outlier?

c. Which is the better summary measure for these data, the mean or the median? Explain.

*3.25 One property of the mean is that if we know the means and sample sizes of two (or more) data sets,
we can calculate the combined mean of both (or all) data sets. The combined mean for two data sets is
calculated by using the formula

_ n\X\ + niXi
Combined mean = x = —

M, + « 2

where /?, and n 2 are the sample sizes of the two data sets and X\ and x 2 are the means of the two data sets,
respectively. Suppose a sample of 10 statistics books gave a mean price of $140 and a sample of 8 math-
ematics books gave a mean price of $160. Find the combined mean. (Hint: For this example:
n, = 10, n 2 = 8,3c, = $140, 3c 2 = $160.)

*3.26 Twenty business majors and 18 economics majors go bowling. Each student bowls one game. The
scorekeeper announces that the mean score for the 18 economics majors is 144 and the mean score for
the entire group of 38 students is 150. Find the mean score for the 20 business majors.

*3.27 For any data, the sum of all values is equal to the product of the sample size and mean; that is,
~%x = nx. Suppose the average amount of money spent on shopping by 10 persons during a given week
is $105.50. Find the total amount of money spent on shopping by these 10 persons.

*3.28 The mean 2009 income for five families was $99,520. What was the total 2009 income of these
five families?

*3.29 The mean age of six persons is 46 years. The ages of five of these six persons are 57, 39, 44, 51,
and 37 years, respectively. Find the age of the sixth person.

3.1 Measures of Central Tendency for Ungrouped Data 91

*3.30 Seven airline passengers in economy class on the same flight paid an average of $361 per ticket.
Because the tickets were purchased at different times and from different sources, the prices varied. The
first five passengers paid $420, $210, $333, $695, and $485. The sixth and seventh tickets were purchased
by a couple who paid identical fares. What price did each of them pay?

*3.31 Consider the following two data sets.

Data Set I: 12 25 37 8 41

Data Set II: 19 32 44 15 48

Notice that each value of the second data set is obtained by adding 7 to the corresponding value of the
first data set. Calculate the mean for each of these two data sets. Comment on the relationship between
the two means.

*3.32 Consider the following two data sets.

Data Set I: 4 8 15 9 11

Data Set II: 8 16 30 18 22

Notice that each value of the second data set is obtained by multiplying the corresponding value of the
first data set by 2. Calculate the mean for each of these two data sets. Comment on the relationship be-
tween the two means.

*3.33 The trimmed mean is calculated by dropping a certain percentage of values from each end of a
ranked data set. The trimmed mean is especially useful as a measure of central tendency when a data set
contains a few outliers at each end. Suppose the following data give the ages (in years) of 10 employees
of a company:

47 53 38 26 39 49 19 67 31 23

To calculate the 10% trimmed mean, first rank these data values in increasing order; then drop 10% of the
smallest values and 10% of the largest values. The mean of the remaining 80% of the values will give the
10% trimmed mean. Note that this data set contains 10 values, and 10% of 10 is 1. Thus, if we drop
the smallest value and the largest value from this data set, the mean of the remaining 8 values will be
called the 10% trimmed mean. Calculate the 10% trimmed mean for this data set.

*3.34 The following data give the prices (in thousands of dollars) of 20 houses sold recently in a city.

184 297 365 309 245 387 369 438 195 390

323 578 410 679 307 271 457 795 259 590

Find the 20% trimmed mean for this data set.

*3.35 In some applications, certain values in a data set may be considered more important than others.
For example, to determine students' grades in a course, an instructor may assign a weight to the final exam
that is twice as much as that to each of the other exams. In such cases, it is more appropriate to use the
weighted mean. In general, for a sequence of n data values x l5 x 2 ,..., x n that are assigned weights w u
w 2 ,..., w,„ respectively, the weighted mean is found by the formula

%xw

Weighted mean = . . —
zw

where Xxw is obtained by multiplying each data value by its weight and then adding the products. Sup-
pose an instructor gives two exams and a final, assigning the final exam a weight twice that of each of the
other exams. Find the weighted mean for a student who scores 73 and 67 on the first two exams and 85
on the final. (Hint: Here, x x = 73, x 2 = 67, x 3 = 85, w l = w 2 = 1, and w } = 2.)

*3.36 When studying phenomena such as inflation or population changes that involve periodic increases
or decreases, the geometric mean is used to find the average change over the entire period under study.
To calculate the geometric mean of a sequence of n values x x , x 2 ,..., x n , we multiply them together and
then find the nth root of this product. Thus

Geometric mean = VJq • x 2 • x 3 • ... • x„

Suppose that the inflation rates for the last five years are 4%, 3%, 5%, 6%, and 8%, respectively. Thus at
the end of the first year, the price index will be 1.04 times the price index at the beginning of the year,
and so on. Find the mean rate of inflation over the 5-year period by finding the geometric mean of the
data set 1.04, 1.03, 1.05, 1.06, and 1.08. (Hint: Here, n = 5, % = 1.04, x 2 = 1.03, and so on. Use the x y "
key on your calculator to find the fifth root. Note that the mean inflation rate will be obtained by sub-
tracting 1 from the geometric mean.)

Chapter 3 Numerical Descriptive Measures

3.2 Measures of Dispersion for Ungrouped Data

The measures of central tendency, such as the mean, median, and mode, do not reveal the whole
picture of the distribution of a data set. Two data sets with the same mean may have completely
different spreads. The variation among the values of observations for one data set may be much
larger or smaller than for the other data set. (Note that the words dispersion, spread, and vari-
ation have the same meaning.) Consider the following two data sets on the ages (in years) of
all workers working for each of two small companies.

Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27

The mean age of workers in both these companies is the same, 40 years. If we do not know
the ages of individual workers at these two companies and are told only that the mean age of
the workers at both companies is the same, we may deduce that the workers at these two com-
panies have a similar age distribution. As we can observe, however, the variation in the workers'
ages for each of these two companies is very different. As illustrated in the diagram, the ages
of the workers at the second company have a much larger variation than the ages of the workers
at the first company.

Company 1 36 39

35 38 40 45 47

Company 2

I I I I I

18 27 33 52 70

Thus, the mean, median, or mode by itself is usually not a sufficient measure to reveal the
shape of the distribution of a data set. We also need a measure that can provide some informa-
tion about the variation among data values. The measures that help us learn about the spread of
a data set are called the measures of dispersion. The measures of central tendency and disper-
sion taken together give a better picture of a data set than the measures of central tendency alone.
This section discusses three measures of dispersion: range, variance, and standard deviation.

3.2.1 Range

The range is the simplest measure of dispersion to calculate. It is obtained by taking the dif-
ference between the largest and the smallest values in a data set.

Finding the Range for Ungrouped Data

Range = Largest value — Smallest value

■ EXAMPLE 3-11

Table 3.4 gives the total areas in square miles of the four western South-Central states of the

Calculating the ranee T . . , _

, , United States.

for ungrouped data.

Table 3.4

Total Area

State

(square miles)

Arkansas

53,182

Louisiana

49,651

Oklahoma

69,903

Texas

267,277

Find the range for this data set.

3.2 Measures of Dispersion for Ungrouped Data 93

Solution The maximum total area for a state in this data set is 267,277 square miles, and
the smallest area is 49,651 square miles. Therefore,

Range = Largest value — Smallest value

= 267,277 - 49,651 = 217,626 square miles

Thus, the total areas of these four states are spread over a range of 217,626 square miles. H

The range, like the mean, has the disadvantage of being influenced by outliers. In Example
3-11, if the state of Texas with a total area of 267,277 square miles is dropped, the range de-
creases from 217,626 square miles to 20,252 square miles. Consequently, the range is not a
good measure of dispersion to use for a data set that contains outliers.

Another disadvantage of using the range as a measure of dispersion is that its calculation is
based on two values only: the largest and the smallest. All other values in a data set are ignored
when calculating the range. Thus, the range is not a very satisfactory measure of dispersion.

3.2.2 Variance and Standard Deviation

The standard deviation is the most-used measure of dispersion. The value of the standard de-
viation tells how closely the values of a data set are clustered around the mean. In general, a
lower value of the standard deviation for a data set indicates that the values of that data set are
spread over a relatively smaller range around the mean. In contrast, a larger value of the stan-
dard deviation for a data set indicates that the values of that data set are spread over a relatively
larger range around the mean.

The standard deviation is obtained by taking the positive square root of the variance. The vari-
ance calculated for population data is denoted by tr 2 (read as sigma squared), 2 and the variance
calculated for sample data is denoted by s~. Consequently, the standard deviation calculated for pop-
ulation data is denoted by tr, and the standard deviation calculated for sample data is denoted by
s. Following are what we will call the basic formulas that are used to calculate the variance: 3

%(x - (xf X(x - xf

o~ = and s~ =

N n - 1

where a 2 is the population variance and s 2 is the sample variance.

The quantity x — /jl or x — x in the above formulas is called the deviation of the x value
from the mean. The sum of the deviations of the x values from the mean is always zero; that
is, t(x — fjb) = and %(x — x) = 0.

For example, suppose the midterm scores of a sample of four students are 82, 95, 67, and
92, respectively. Then, the mean score for these four students is

82 + 95 + 67 + 92

x = = 84

The deviations of the four scores from the mean are calculated in Table 3.5. As we can observe from
the table, the sum of the deviations of the x values from the mean is zero; that is, %(x — x) = 0.
For this reason we square the deviations to calculate the variance and standard deviation.

Table 3.5

x — X

- 84 =

-2

- 84 =

+ 11

- 84 =

-17

- 84 =

+ 8

X(x -x) =

2 Note that % is uppercase sigma and a is lowercase sigma of the Greek alphabet.

3 From the formula for a 2 , it can be stated that the population variance is the mean of the squared deviations of x values
from the mean. However, this is not true for the variance calculated for a sample data set.

94 Chapter 3 Numerical Descriptive Measures

From the computational point of view, it is easier and more efficient to use short-cut for-
mulas to calculate the variance and standard deviation. By using the short-cut formulas, we
reduce the computation time and round-off errors. Use of the basic formulas for ungrouped data
is illustrated in Section A3. 1.1 of Appendix 3.1 of this chapter. The short-cut formulas for
calculating the variance and standard deviation are given next.

Short-Cut Formulas for the Variance and Standard Deviation for Ungrouped Data

o-- = and s =

N n-\

where a 1 is the population variance and s 2 is the sample variance.

The standard deviation is obtained by taking the positive square root of the variance.

Population standard deviation: a = V o"
Sample standard deviation:

Note that the denominator in the formula for the population variance is N, but that in the for-
mula for the sample variance it is n — 1 . 4

to billions of dollars) of five

rket Value
ns of dollars)

75
107
271
138

Solution Let x denote the 2008 market value (in billions of dollars) of a company. The values
of Xx and Xx 2 are calculated in Table 3.6.

Table 3.6

x 2

5625

107

11,449

271

73,441

138

19,044

5041

Xx = 662

Xx 2 = 114,600

Calculation of the variance involves the following four steps.

Calculating the sample variance
and standard deviation
for ungrouped data.

■ EXAMPLE 3-12

The following table gives the 2008 market values (rounc
international companies.

Company

PepsiCo
Google
PetroChina
Johnson & Johnson
Intel

Mai
(billioi

Find the variance and standard deviation for these data.

4 The reason that the denominator in the sample formula is n — 1 and not n follows: The sample variance underesti-
mates the population variance when the denominator in the sample formula for variance is n. However, the sample
variance does not underestimate the population variance if the denominator in the sample formula for variance is
n — 1 . In Chapter 8 we will learn that n — 1 is called the degrees of freedom.

3.2 Measures of Dispersion for Ungrouped Data 95

Step 1. Calculate Xx.

The sum of the values in the first column of Table 3.6 gives the value of Xx, which
is 662.

Step 2. Find Xx 2 .

The value of Xx 2 is obtained by squaring each value of x and then adding the squared val-
ues. The results of this step are shown in the second column of Table 3.6. Notice that
Xx 2 = 1 14,600.

Step 3. Determine the variance.

Substitute all the values in the variance formula and simplify. Because the given data
are on the market values of only five companies, we use the formula for the sample
variance.

(Sx) 2 (662) 2

2 ^~^r 114 ' 600 -^ 114,600 - 87,648.80

s 2 = = = = 6737.80

n - 1 5-1 4

Step 4. Obtain the standard deviation.

The standard deviation is obtained by taking the (positive) square root of the variance.

s = V6737.80 = 82.0841 = $82.08 billion

Thus, the standard deviation of the market values of these five companies is $82.08
billion.

1. The values of the variance and the standard deviation are never negative. That is, Two Observations
the numerator in the formula for the variance should never produce a negative value.

Usually the values of the variance and standard deviation are positive, but if a data set
has no variation, then the variance and standard deviation are both zero. For example,
if four persons in a group are the same age — say, 35 years — then the four values in the
data set are

35 35 35 35

If we calculate the variance and standard deviation for these data, their values are zero. This
is because there is no variation in the values of this data set.

2. The measurement units of variance are always the square of the measurement units
of the original data. This is so because the original values are squared to calculate the
variance. In Example 3-12, the measurement units of the original data are billions of dol-
lars. However, the measurement units of the variance are squared billions of dollars,
which, of course, does not make any sense. Thus, the variance of the 2008 market val-
ues of these five companies in Example 3-12 is 6737.80 squared billion dollars. But the
measurement units of the standard deviation are the same as the measurement units of
the original data because the standard deviation is obtained by taking the square root of
the variance.

EXAMPLE 3-13

Following are the 2009 earnings (in thousands of dollars) before taxes for all six employees
of a small company.

88.50 108.40 65.50 52.50 79.80 54.60

Calculate the variance and standard deviation for these data.

Calculating the population
variance and standard deviation
for ungrouped data.

Solution Let x denote the 2009 earnings before taxes of an employee of this company. The
values of Xx and Xx 2 are calculated in Table 3.7.

96 Chapter 3 Numerical Descriptive Measures

Table 3.7

x 2

00. jU

ion tc
IooZ.Zj

108.40

11,750.56

65.50

4290.25

52.50

2756.25

79.80

6368.04

54.60

2981.16

= 449.30

2x 2 = 35,978.51

Because the data in this example are on earnings of all employees of this company, we use
the population formula to compute the variance. Thus, the variance is

(Xx) 2 (449.30) 2
Xx 2 - 35,978.51 - -

6 — = 388.90

N 6

The standard deviation is obtained by taking the (positive) square root of the variance:

a = V388.90 = 19.721 thousand = $19,721

Thus, the standard deviation of the 2009 earnings of all six employees of this company is
$19,721. ■

Warning ► Note that Xx 2 is not the same as (Xx) 2 . The value of Xx 2 is obtained by squaring the x values
and then adding them. The value of (Xx) 2 is obtained by squaring the value of Xx.

The uses of the standard deviation are discussed in Section 3.4. Later chapters explain how
the mean and the standard deviation taken together can help in making inferences about the
population.

3.2.3 Population Parameters and Sample Statistics

A numerical measure such as the mean, median, mode, range, variance, or standard deviation
calculated for a population data set is called a population parameter, or simply a parameter.
A summary measure calculated for a sample data set is called a sample statistic, or simply a
statistic. Thus, /j, and <x are population parameters, and x and s are sample statistics. As an il-
lustration, x = $228 billion in Example 3-1 is a sample statistic, and j± = 45.25 years in Ex-
ample 3-2 is a population parameter. Similarly, s = $82.08 billion in Example 3-12 is a sam-
ple statistic, whereas a = $19,721 in Example 3-13 is a population parameter.

EXERCISES

CONCEPTS AND PROCEDURES

3.37 The range, as a measure of spread, has the disadvantage of being influenced by outliers. Illustrate
this with an example.

3.38 Can the standard deviation have a negative value? Explain.

3.39 When is the value of the standard deviation for a data set zero? Give one example. Calculate the
standard deviation for the example and show that its value is zero.

3.40 Briefly explain the difference between a population parameter and a sample statistic. Give one ex-
ample of each.

3.41 The following data set belongs to a population:

5 -7 2 -9 1 61 07
Calculate the range, variance, and standard deviation.

3.2 Measures of Dispersion for Ungrouped Data 97

3.42 The following data set belongs to a sample:

14 18 -1 08 8 -16
Calculate the range, variance, and standard deviation.

■ APPLICATIONS

3.43 The following data give the number of shoplifters apprehended during each of the past 8 weeks at a
large department store.

7 1 08 3 1 51 26 1 1

a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the
sum of these deviations zero?

b. Calculate the range, variance, and standard deviation.

3.44 The following data give the prices of seven textbooks randomly selected from a university bookstore.
$89 $170 $104 $113 $56 $161 $147

a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the
sum of these deviations zero?

b. Calculate the range, variance, and standard deviation.

3.45 The following data give the numbers of car thefts that occurred in a city in the past 12 days.
6371 14 38726915

Calculate the range, variance, and standard deviation.

3.46 Refer to the data in Exercise 3.23, which contained the numbers of tornadoes that touched down in
12 states that had the most tornadoes during the period 1950 to 1994. The data are reproduced here.

1113 2009 1374 1137 2110 1086 1166 1039 1673 2300 1139 5490

Find the variance, standard deviation, and range for these data.

3.47 The following data give the numbers of pieces of junk mail received by 10 families during the past
month.

41 33 28 21 29 19 14 31 39 36
Find the range, variance, and standard deviation.

3.48 The following data give the number of highway collisions with large wild animals, such as deer or
moose, in one of the northeastern states during each week of a 9-week period.

71 03 825749

Find the range, variance, and standard deviation.

3.49 Attacks by stinging insects, such as bees or wasps, may become medical emergencies if either the
victim is allergic to venom or multiple stings are involved. The following data give the number of patients
treated each week for such stings in a large regional hospital during 13 weeks last summer.

1523041701201

Compute the range, variance, and standard deviation for these data.

3.50 The following data give the number of hot dogs consumed by 10 participants in a hot-dog-eating
contest.

21 17 32 8 20 15 17 23 9 18

Calculate the range, variance, and standard deviation for these data.

3.51 Following are the temperatures (in degrees Fahrenheit) observed during eight wintry days in a mid-
western city:

23 14 6 -7 -2 11 16 19

Compute the range, variance, and standard deviation.

3.52 The following data give the numbers of hours spent partying by 10 randomly selected college stu-
dents during the past week.

7 1 45 9 7 1 04 8

Compute the range, variance, and standard deviation.

Chapter 3 Numerical Descriptive Measures

3.53 The following data represent the total points scored in each of the NFL championship games played
from 2000 through 2009 in that order.

39 41 37 69 61 45 31 46 31 50

Compute the variance, standard deviation, and range for these data.

3.54 The following data represent the 2006 guaranteed annual salaries (in thousands of dollars) of the head
coaches of the final eight teams in the 2006 NCAA Men's Basketball Championship. The data are given in
the following order: Connecticut, Florida, George Mason, LSU, Memphis, Texas, UCLA, and Villanova.

1500 1389 489 900 1315 1800 1150 584

Compute the variance, standard deviation, and range for these data.

3.55 The following data give the hourly wage rates of eight employees of a company.
$22 22 22 22 22 22 22 22

Calculate the standard deviation. Is its value zero? If yes, why?

3.56 The following data are the ages (in years) of six students.

19 19 19 19 19 19

Calculate the standard deviation. Is its value zero? If yes, why?

*3.57 One disadvantage of the standard deviation as a measure of dispersion is that it is a measure of ab-
solute variability and not of relative variability. Sometimes we may need to compare the variability of two
different data sets that have different units of measurement. The coefficient of variation is one such meas-
ure. The coefficient of variation, denoted by CV, expresses standard deviation as a percentage of the mean
and is computed as follows:

For population data: CV = — X 100%
M

For sample data: CV = = X 100%

The yearly salaries of all employees who work for a company have a mean of $62,350 and a standard de-
viation of $6820. The years of experience for the same employees have a mean of 15 years and a stan-
dard deviation of 2 years. Is the relative variation in the salaries larger or smaller than that in years of ex-
perience for these employees?

*3.58 The SAT scores of 100 students have a mean of 975 and a standard deviation of 105. The CPAs of
the same 100 students have a mean of 3.16 and a standard deviation of .22. Is the relative variation in SAT
scores larger or smaller than that in CPAs?

*3.59 Consider the following two data sets.

Data Set I: 12 25 37 8 41

Data Set II: 19 32 44 15 48

Note that each value of the second data set is obtained by adding 7 to the corresponding value of the first
data set. Calculate the standard deviation for each of these two data sets using the formula for sample data.
Comment on the relationship between the two standard deviations.

*3.60 Consider the following two data sets.

Data Set I: 4 8 15 9 11

Data Set II: 8 16 30 18 22

Note that each value of the second data set is obtained by multiplying the corresponding value of the first
data set by 2. Calculate the standard deviation for each of these two data sets using the formula for pop-
ulation data. Comment on the relationship between the two standard deviations.

3.3 Mean, Variance, and Standard Deviation
for Grouped Data

In Sections 3.1.1 and 3.2.2, we learned how to calculate the mean, variance, and standard de-
viation for ungrouped data. In this section, we will learn how to calculate the mean, variance,
and standard deviation for grouped data.

3.3 Mean, Variance, and Standard Deviation for Grouped Data 99

3.3.1 Mean for Grouped Data

We learned in Section 3.1.1 that the mean is obtained by dividing the sum of all values by the
number of values in a data set. However, if the data are given in the form of a frequency table,
we no longer know the values of individual observations. Consequently, in such cases, we can-
not obtain the sum of individual values. We find an approximation for the sum of these values
using the procedure explained in the next paragraph and example. The formulas used to calcu-
late the mean for grouped data follow.

Calculating Mean for Grouped Data

Mean for population data: jj.

Mean for sample data: x
where m is the midpoint and /is the frequency of a class.

To calculate the mean for grouped data, first find the midpoint of each class and then mul-
tiply the midpoints by the frequencies of the corresponding classes. The sum of these products,
denoted by %mf, gives an approximation for the sum of all values. To find the value of the mean,
divide this sum by the total number of observations in the data.

■ EXAMPLE 3-14

Table 3.8 gives the frequency distribution of the daily commuting times (in minutes) from
home to work for all 25 employees of a company.

Table 3.8

Daily Commuting Time Number of

(minutes) Employees

to less than 10 4

10 to less than 20 9

20 to less than 30 6

30 to less than 40 4

40 to less than 50 2

Calculate the mean of the daily commuting times.

Solution Note that because the data set includes all 25 employees of the company, it rep-
resents the population. Table 3.9 shows the calculation of %mf. Note that in Table 3.9, m
denotes the midpoints of the classes.

Table 3.9

Daily Commuting Time

(minutes)

to less than 10

10 to less than 20

135

20 to less than 30

150

30 to less than 40

140

40 to less than 50

N = 25

tmf = 535

Zrrtf
N

Calculating the population
mean for grouped data.

100 Chapter 3 Numerical Descriptive Measures

To calculate the mean, we first find the midpoint of each class. The class midpoints are
recorded in the third column of Table 3.9. The products of the midpoints and the correspon-
ding frequencies are listed in the fourth column. The sum of the fourth column values, de-
noted by %mf, gives the approximate total daily commuting time (in minutes) for all 25 em-
ployees. The mean is obtained by dividing this sum by the total frequency. Therefore,

%mf 535 „, An .
ii = = = 21.40 minutes

Af 25

Thus, the employees of this company spend an average of 21.40 minutes a day commuting
from home to work.

What do the numbers 20, 135, 150, 140, and 90 in the column labeled mf in Table 3.9 rep-
resent? We know from this table that 4 employees spend to less than 10 minutes commuting
per day. If we assume that the time spent commuting by these 4 employees is evenly spread in
the interval to less than 10, then the midpoint of this class (which is 5) gives the mean time
spent commuting by these 4 employees. Hence, 4 X 5 = 20 is the approximate total time (in
minutes) spent commuting per day by these 4 employees. Similarly, 9 employees spend 10 to
less than 20 minutes commuting per day, and the total time spent commuting by these 9 em-
ployees is approximately 135 minutes a day. The other numbers in this column can be inter-
preted in the same way. Note that these numbers give the approximate commuting times for
these employees based on the assumption of an even spread within classes. The total commut-
ing time for all 25 employees is approximately 535 minutes. Consequently, 21.40 minutes is an
approximate and not the exact value of the mean. We can find the exact value of the mean only
if we know the exact commuting time for each of the 25 employees of the company.

Calculating the sample mean
for grouped data.

■ EXAMPLE 3-15

Table 3.10 gives the frequency distribution of the number of orders received each day during
the past 50 days at the office of a mail-order company.

Table 3.10

Number of Orders

Number of Days

10-12

13-15

16-18

19-21

Calculate the mean.

Solution Because the data set includes only 50 days, it represents a sample. The value of
Xmf is calculated in Table 3.11.

Table 3.11

Number of Orders

10-12

13-15

168

16-18

340

19-21

280

n = 50

2m/ = 832

The value of the sample mean is

_ Xmf 832

16.64 orders

3.3 Mean, Variance, and Standard Deviation for Grouped Data 101

Thus, this mail-order company received an average of 16.64 orders per day during these
50 days. ■

3.3.2 Variance and Standard Deviation for Grouped Data

Following are what we will call the basic formulas used to calculate the population and sample
variances for grouped data:

X/(m-M) 2 X/(m-I) 2
a = and s

V n - 1

where a 2 is the population variance, s 2 is the sample variance, and m is the midpoint of a class.

In either case, the standard deviation is obtained by taking the positive square root of the
variance.

Again, the short-cut formulas are more efficient for calculating the variance and standard
deviation. Section A3. 1.2 of Appendix 3.1 at the end of this chapter shows how to use the ba-
sic formulas to calculate the variance and standard deviation for grouped data.

Short-Cut Formulas for the Variance and Standard Deviation for Grouped Data

- 2 , (W) 2 „ 2f (Xm/) 2

Zm f Xm f ■

N n

and s =

N n - 1

a 2 = and s 2

where a 2 is the population variance, s 2 is the sample variance, and m is the midpoint of a class.
The standard deviation is obtained by taking the positive square root of the variance.

Population standard deviation: cr -
Sample standard deviation: j = Vf

Examples 3-16 and 3-17 illustrate the use of these formulas to calculate the variance and
standard deviation.

EXAMPLE 3-16

The following data, reproduced from Table 3.8 of Example 3-14, give the frequency distri-
bution of the daily commuting times (in minutes) from home to work for all 25 employees of . , _ , , , . ^

J ° variance and standard deviation

a company. for grouped data.

Daily Commuting Time
(minutes)

Number of Employees

to less than 10

10 to less than 20

20 to less than 30

30 to less than 40

40 to less than 50

Calculate the variance and standard deviation.

Solution All four steps needed to calculate the variance and standard deviation for grouped
data are shown after Table 3.12.

102

Chapter 3 Numerical Descriptive Measures
Table 3.12

Daily Commuting Time

(minutes)

m 2 f

to less than 10

100

10 to less than 20

135

2025

20 to less than 30

150

3750

30 to less than 40

140

4900

40 to less than 50

4050

N — 25

2m/ = 535

2m 2 / = 14,825

Step 1. Calculate the value of %mf.

To calculate the value of Xmf first find the midpoint m of each class (see the third column
in Table 3.12) and then multiply the corresponding class midpoints and class frequencies
(see the fourth column). The value of 2m/ is obtained by adding these products. Thus,

2m/ = 535

Step 2. Find the value of Xm f.

To find the value of 2m 2 /, square each m value and multiply this squared value of m by
the corresponding frequency (see the fifth column in Table 3.12). The sum of these prod-
ucts (that is, the sum of the fifth column) gives 2m 2 /. Hence,

2m 2 / = 14,825

Step 3. Calculate the variance.

Because the data set includes all 25 employees of the company, it represents the popula-
tion. Therefore, we use the formula for the population variance:

, (2m/) 2 (535) 2
2m 2 / - — 14,825 -

25 3376 135.04

N 25 25

Step 4. Calculate the standard deviation.

To obtain the standard deviation, take the (positive) square root of the variance.

o- = Vct 2 = V135.04 = 11.62 minutes

Thus, the standard deviation of the daily commuting times for these employees is 11.62
minutes. H

Note that the values of the variance and standard deviation calculated in Example 3-16 for
grouped data are approximations. The exact values of the variance and standard deviation can be
obtained only by using the ungrouped data on the daily commuting times of the 25 employees.

Calculating the sample
variance and standard
deviation for grouped data.

■ EXAMPLE 3-17

The following data, reproduced from Table 3.10 of Example 3-15, give the frequency distri-
bution of the number of orders received each day during the past 50 days at the office of a
mail-order company.

Number of Orders

10-12

13-15

16-18

19-21

Calculate the variance and standard deviation.

3.3 Mean, Variance, and Standard Deviation for Grouped Data 103

Solution All the information required for the calculation of the variance and standard de-
viation appears in Table 3.13.

Table 3.13

Number of Orders

m 2 f

10-12

484

13-15

168

2352

16-18

340

5780

19-21

280

5600

n = 50

2m/ = 832

2m 2 / = 14,216

Because the data set includes only 50 days, it represents a sample. Hence, we use the sam-
ple formulas to calculate the variance and standard deviation. By substituting the values into
the formula for the sample variance, we obtain

(2m/) 2 (832) 2
Sm 2 / 14,216 —

n TX 50

= 7.5820

n - 1 50-1

Hence, the standard deviation is

s = V? = V7.5820 = 2.75 orders

Thus, the standard deviation of the number of orders received at the office of this mail-order
company during the past 50 days is 2.75.

EXERCISES

CONCEPTS AND PROCEDURES

3.61 Are the values of the mean and standard deviation that are calculated using grouped data exact or
approximate values of the mean and standard deviation, respectively? Explain.

3.62 Using the population formulas, calculate the mean, variance, and standard deviation for the follow-
ing grouped data.

2-4

5-7

8-10

11-13

14-16

4 7

3.63 Using the sample formulas, find the mean, variance, and standard deviation for the grouped data dis-
played in the following table.

x /

to less than 4 17

4 to less than 8 23

8 to less than 12 15

12 to less than 16 11

16 to less than 20 8

20 to less than 24 6

■ APPLICATIONS

3.64 The following table gives the frequency distribution of the amounts of telephone bills for October
2009 for a sample of 50 families.

104 Chapter 3 Numerical Descriptive Measures

A t r« rri I l it'll

Amount of Telephone Bill

I tit Jlltil a ^

ii iiiiiuci ui i amines

40 to less man /0

70 to less than 100

100 to less than 130

130 to less than 160

160 to less than 190

Calculate the mean, variance, and standard deviation.

3.65 The following table gives the frequency distribution of the number of hours spent per week playing
video games by all 60 students of the eighth grade at a school.

Hours per Week

Number of Students

to less than 5

5 to less than 10

10 to less than 15

15 to less than 20

20 to less than 25

25 to less than 30

Find the mean, variance, and standard deviation.

3.66 The following table gives the grouped data on the weights of all 100 babies born at a hospital in
2009.

Weight (pounds) Number of Babies

3 to less than 5 5

5 to less than 7 30

7 to less than 9 40

9 to less than 1 1 20

1 1 to less than 13 5

Find the mean, variance, and standard deviation.

3.67 The following table gives the frequency distribution of the total miles driven during 2009 by 300 car
owners.

Miles Driven in 2009

(in thousands) Number of Car Owners

to less than 5

5 to less than 10

10 to less than 15

15 to less than 20

20 to less than 25

25 to less than 30

30 to less than 35

35 to less than 40

Find the mean, variance, and standard deviation. Give a brief interpretation of the values in the column
labeled mf'm your table of calculations. What does 2m/ represent?

3.68 The following table gives information on the amounts (in dollars) of electric bills for August 2009
for a sample of 50 families.

3.4 Use of Standard Deviation

105

A j_ J* T711 j • it 'II

Amount ot Electric Bill

11 lllllUCI 111 J. <III1111L>>

to less than 40

40 to less than 80

80 to less than 120

120 to less than 160

160 to less than 200

Find the mean, variance, and standard deviation. Give a brief interpretation of the values in the column
labeled m/in your table of calculations. What does represent?

3.69 For 50 airplanes that arrived late at an airport during a week, the time by which they were late was
observed. In the following table, x denotes the time (in minutes) by which an airplane was late, and /de-
notes the number of airplanes.

to less than 20

20 to less than 40

40 to less than 60

60 to less than 80

80 to less than 100

Find the mean, variance, and standard deviation.

3.70 The following table gives the frequency distribution of the number of errors committed by a college
baseball team in all of the 45 games that it played during the 2008-09 season.

Number of Errors

Number of Games

Find the mean, variance, and standard deviation. (Hint: The classes in this example are single valued. These
values of classes will be used as values of m in the formulas for the mean, variance, and standard deviation.)

3.71 Spot prices per barrel of crude oil reached their highest levels in history during June and July of
2008. The following data give the spot prices (in dollars) of a barrel of crude oil for 14 business days from
June 30, 2008, through July 18, 2008 (Energy Information Administration, April 15, 2009).

139.96 141.06 143.74 145.31 141.38 136.06 135.88
141.47 144.96 145.16 138.68 134.63 129.43 128.94

a. Find the mean for these data.

b. Construct a frequency distribution table for these data using a class width of 3.00 and the lower
boundary of the first class equal to 128.00.

c. Using the method of Section 3.3.1, find the mean of the grouped data of part b.

d. Compare your means from parts a and c. If the two means are not equal, explain why they differ.

3.4 Use of Standard Deviation

By using the mean and standard deviation, we can find the proportion or percentage of the to-
tal observations that fall within a given interval about the mean. This section briefly discusses
Chebyshev's theorem and the empirical rule, both of which demonstrate this use of the stan-
dard deviation.

106 Chapter 3 Numerical Descriptive Measures

3.4.1 Chebyshev's Theorem

Chebyshev's theorem gives a lower bound for the area under a curve between two points that
are on opposite sides of the mean and at the same distance from the mean.

Definition

Chebyshev's Theorem For any number k greater than 1, at least (1 — l/k 2 ) of the data values lie
within k standard deviations of the mean.

Figure 3.5 illustrates Chebyshev's theorem.

- = 1 - .25 = .75or75 £

Thus, for example, if k = 2, then
_ j_ _ 1

Therefore, according to Chebyshev's theorem, at least .75, or 75%, of the values of a data set
lie within two standard deviations of the mean. This is shown in Figure 3.6.

Figure 3.6 Percentage of values within two standard
deviations of the mean for Chebyshev's theorem.

At least 75% of
the values lie in
the shaded areas

If k = 3, then,

k 2

p. -2a
1

W 2 =

u + 2o

1 - .11 = .89 or 89% approximately

According to Chebyshev's theorem, at least .89, or 89%, of the values fall within three standard
deviations of the mean. This is shown in Figure 3.7.

Figure 3.7 Percentage of values within three
standard deviations of the mean for Chebyshev's
theorem.

At least 89% of
the values lie in
the shaded areas

Although in Figures 3.5 through 3.7 we have used the population notation for the mean and
standard deviation, the theorem applies to both sample and population data. Note that Chebyshev's
theorem is applicable to a distribution of any shape. However, Chebyshev's theorem can be used

3.4 Use of Standard Deviation

107

only for k > 1 . This is so because when k
the value of 1 — 1 /k 2 is negative.

1, the value of 1 — 1/k 2 is zero, and when k < 1,

■ EXAMPLE 3-18

The average systolic blood pressure for 4000 women who were screened for high blood pres-
sure was found to be 187 mm Hg with a standard deviation of 22. Using Chebyshev's theo-
rem, find at least what percentage of women in this group have a systolic blood pressure be-
tween 143 and 231 mm Hg.

Solution Let fx and a be the mean and the standard deviation, respectively, of the systolic
blood pressures of these women. Then, from the given information,

/JL = 187 and cr = 22

To find the percentage of women whose systolic blood pressures are between 143 and
231 mm Hg, the first step is to determine k. As shown below, each of the two points, 143 and
231, is 44 units away from the mean.

Applying Chebyshev's theorem.

143 - 187

-44-

•231 - 187 = 44-

143

fi = 187

231

The value of k is obtained by dividing the distance between the mean and each point by the
standard deviation. Thus,

'k 1

k = 44/22
1 1
W~ 1 ~4 =

= 2

1 - .25

.75 or 75%

At least 75% of the
women have systolic
blood pressure between
143 and 231

231

H +2a

Systolic blood
pressure

Figure 3.8 Percentage of women with systolic blood pressure
between 143 and 231.

Hence, according to Chebyshev's theorem, at least 75% of the women have systolic blood
pressure between 143 and 231 mm Hg. This percentage is shown in Figure 3.8. I

3.4.2 Empirical Rule

Whereas Chebyshev's theorem is applicable to any kind of distribution, the empirical rule
applies only to a specific type of distribution called a bell-shaped distribution, as shown in
Figure 3.9. More will be said about such a distribution in Chapter 6, where it is called a nor-
mal curve. In this section, only the following three rules for the curve are given.

Empirical Rule For a bell-shaped distribution, approximately

1. 68% of the observations lie within one standard deviation of the mean.

2. 95% of the observations lie within two standard deviations of the mean.

3. 99.7% of the observations lie within three standard deviations of the mean.

Figure 3.9 illustrates the empirical rule. Again, the empirical rule applies to both popula-
tion data and sample data.

HERE
COMES
THE SD

Source: Daniel Seligman, "Here comes
the SD," Fortune, May 15, 1995.
Copyright © 1995, The Time Inc.
Reproduced with permission. All rights
reserved.

When your servant first became a Fortune writer several decades ago, it was hard doctrine that "several"
meant three to eight, also that writers must not refer to "gross national product" without pausing to define
this arcane term. GNP was in fact a relatively new concept at the time, having been introduced to the coun-
try only several years previously— in Roosevelt's 1944 budget message— so the presumption that readers had
to be told repeatedly it was the "value of all goods and services produced by the economy" seemed entirely
reasonable to this young writer, who personally had to look up the definition every time.

Numeracy lurches on. Nowadays the big question for editors is whether an average college-educated
bloke needs a handhold when confronted with the term "standard deviation." The SD is suddenly on-
stage because the Securities and Exchange Commission is wondering aloud whether investment com-
panies should be required to tell investors the standard deviation of their mutual funds' total returns
over various past periods. Barry Barbash, SEC director of investment management, favors the require-
ment but confessed to the Washington Post that he worries about investors who will think a standard
deviation is the dividing line on a highway or something.

The view around here is that the SEC is performing a noble service, but only partly because the re-
quirement would enhance folks' insights into mutual funds. The commission's underlying idea is to give
investors a better and more objective measure than is now available of the risk associated with differ-
ent kinds of portfolios. The SD is a measure of variability, and funds with unusually variable returns—
sometimes very high, sometimes very low— are presumed to be more risky.

What one really likes about the proposal, however, is the prospect that it will incentivize millions of
greedy Americans to learn a little elementary statistics. One already has a list of issues that could be dis-
cussed much more thrillingly if only your average liberal arts graduate had a glimmer about the SD and
the normal curve. The bell-shaped normal curve, or rather, the area underneath the curve, shows you
how Providence arranged for things to be distributed in our world— with people's heights, or incomes, or
IQs, or investment returns bunched around middling outcomes, and fewer and fewer cases as you move
down and out toward the extremes. A line down the center of the curve represents the mean outcome,
and deviations from the mean are measured by the SD.

An amazing property of the SD is that exactly 68.26% of all normally distributed data are within one SD
of the mean. We once asked a professor of statistics a question that seemed to us quite profound, to wit,
why that particular figure? The Prof answered dismissively that God had decided on 68.26% for exactly the
same reason He had landed on 3.14 as the ratio between circumferences and diameters— because He just
felt like it. The Almighty has also proclaimed that 95.44% of all data are within two SDs of the mean, and
99.73% within three SDs. When you know the mean and SD of some outcome, you can instantly establish
the percentage probability of its occurrence. White men's heights in the U.S. average 69.2 inches, with an SD
of 2.8 inches (according to the National Center for Health Statistics), which means that a 6-foot-5 chap is in
the 99th percentile. In 1994, scores on the verbal portion of the Scholastic Assessment Test had a mean of
423 and an SD of 113, so if you scored 649— two SDs above the mean— you were in the 95th percentile.

As the SEC is heavily hinting, average outcomes are interesting but for many purposes inadequate; one
also yearns to know the variability around that average. From 1926 through 1994, the S&P 500 had an av-
erage annual return of just about 10%. The SD accompanying that figure was just about 20%. Since returns
will be within 1 SD some 68% of the time, they will be more than 1 SD from the mean 32% of the time.
And since half these swings will be on the downside, we expect fund owners to lose more than 10% of
their money about one year out of six and to lose more than 30% (two SDs below the mean) about one
year out of 20. If your time horizon is short and you can't take losses like that, you arguably don't belong in
stocks. If you think SDs are highway dividers, you arguably don't belong in cars.

Figure 3.9 Illustration of the empirical rule.

(i-3o ji-2o u-o u, |i + u + 2au + 3o

108

3.4 Use of Standard Deviation

109

■ EXAMPLE 3-19

The age distribution of a sample of 5000 persons is bell shaped with a mean of 40 years and \ ~

a standard deviation of 12 years. Determine the approximate percentage of people who are 16 Applying the empirical rule.

to 64 years old.

Solution We use the empirical rule to find the required percentage because the distribution
of ages follows a bell-shaped curve. From the given information, for this distribution,

x = 40 years and s = 12 years

16 x = 40 64 Ages

t t

x - 2s x + 2s

Figure 3.10 Percentage of people who are 16 to 64 years old.

Each of the two points, 16 and 64, is 24 units away from the mean. Dividing 24 by 12, we con-
vert the distance between each of the two points and the mean in terms of standard deviations.
Thus, the distance between 16 and 40 and that between 40 and 64 is each equal to 2s. Conse-
quently, as shown in Figure 3.10, the area from 16 to 64 is the area from 3c — 2s to 3c + 2s.

Because the area within two standard deviations of the mean is approximately 95% for a bell-
shaped curve, approximately 95% of the people in the sample are 16 to 64 years old. I

EXERCISES

CONCEPTS AND PROCEDURES

3.72 Briefly explain Chebyshev's theorem and its applications.

3.73 Briefly explain the empirical rule. To what kind of distribution is it applied?

3.74 A sample of 2000 observations has a mean of 74 and a standard deviation of 12. Using Chebyshev's
theorem, find at least what percentage of the observations fall in the intervals x ± 2s, x ± 2.5s, and
x ± 3s. Note that here x ± 2s represents the interval x — 2s to x + 2s, and so on.

3.75 A large population has a mean of 230 and a standard deviation of 41. Using Chebyshev's theorem,
find at least what percentage of the observations fall in the intervals /x ± 2a, /jl ± 2.5cr, and /jl ± 3cr.

3.76 A large population has a mean of 310 and a standard deviation of 37. Using the empirical rule, find
what percentage of the observations fall in the intervals /x ± la, /jl ± 2a, and /jl ± 3a.

3.77 A sample of 3000 observations has a mean of 82 and a standard deviation of 16. Using the empiri-
cal rule, find what percentage of the observations fall in the intervals x ± Is, x ± 2s, and x ± 3s.

■ APPLICATIONS

3.78 The mean time taken by all participants to run a road race was found to be 220 minutes with a stan-
dard deviation of 20 minutes. Using Chebyshev's theorem, find the percentage of runners who ran this
road race in

a. 180 to 260 minutes b. 160 to 280 minutes c. 170 to 270 minutes

3.79 The 2009 gross sales of all companies in a large city have a mean of $2.3 million and a standard de-
viation of $.6 million. Using Chebyshev's theorem, find at least what percentage of companies in this city
had 2009 gross sales of

a. $1.1 to $3.5 million b. $.8 to $3.8 million c. $.5 to $4.1 million

110

Chapter 3 Numerical Descriptive Measures

3.80 Suppose the average credit card debt for households currently is $9500 with a standard deviation of $2600.
a. Using Chebyshev's theorem, find at least what percentage of current credit card debts for all house-
holds are between

i. $4300 and $14,700 ii. $3000 and $16,000
*b. Using Chebyshev's theorem, find the interval that contains credit card debts of at least 89% of all
households.

3.81 The mean monthly mortgage paid by all home owners in a town is $2365 with a standard deviation
of $340.

a. Using Chebyshev's theorem, find at least what percentage of all home owners in this town pay a
monthly mortgage of

i. $1685 to $3045 ii. $1345 to $3385

*b. Using Chebyshev's theorem, find the interval that contains the monthly mortgage payments of at
least 84% of all home owners.

3.82 The mean life of a certain brand of auto batteries is 44 months with a standard deviation of 3 months.
Assume that the lives of all auto batteries of this brand have a bell-shaped distribution. Using the empir-
ical rule, find the percentage of auto batteries of this brand that have a life of

a. 41 to 47 months b. 38 to 50 months c. 35 to 53 months

3.83 According to an article in the Washington Post (Washington Post, January 5, 2009), the average em-
ployee share of health insurance premiums at large U.S. companies is expected to be $3423 in 2009. Sup-
pose that the current annual payments by all such employees toward health insurance premiums have a
bell-shaped distribution with a mean of $3423 and a standard deviation of $520. Using the empirical rule,
find the approximate percentage of employees whose annual payments toward such premiums are between

a. $1863 and $4983 b. $2903 and $3943 c. $2383 and $4463

3.84 The prices of all college textbooks follow a bell-shaped distribution with a mean of $105 and a stan-
dard deviation of $20.

a. Using the empirical rule, find the percentage of all college textbooks with their prices between
i. $85 and $125 ii. $65 and $145

*b. Using the empirical rule, find the interval that contains the prices of 99.7% of college textbooks.

3.85 Suppose that on a certain section of 1-95 with a posted speed limit of 65 mph, the speeds of all
vehicles have a bell-shaped distribution with a mean of 72 mph and a standard deviation of 3 mph.

a. Using the empirical rule, find the percentage of vehicles with the following speeds on this sec-
tion of 1-95.

i. 63 to 81 mph ii. 69 to 75 mph

*b. Using the empirical rule, find the interval that contains the speeds of 95% of vehicles traveling
on this section of 1-95.

3.5 Measures of Position

A measure of position determines the position of a single value in relation to other values in
a sample or a population data set. There are many measures of position; however, only quar-
tiles, percentiles, and percentile rank are discussed in this section.

3.5.1 Quartiles and Interquartile Range

Quartiles are the summary measures that divide a ranked data set into four equal parts. Three
measures will divide any data set into four equal parts. These three measures are the first quar-
tile (denoted by giX the second quartile (denoted by Q 2 ), and the third quartile (denoted by
Q 3 ). The data should be ranked in increasing order before the quartiles are determined. The
quartiles are defined as follows.

Definition

Quartiles Quartiles are three summary measures that divide a ranked data set into four equal
parts. The second quartile is the same as the median of a data set. The first quartile is the value
of the middle term among the observations that are less than the median, and the third quartile
is the value of the middle term among the observations that are greater than the median.

Figure 3.11 describes the positions of the three quartiles.

3.5 Measures of Position

111

Each of these portions contains 25% of the Figure 3.1 1 Quartiles.

observations of a data set arranged in increasing order

I I I 4

25%

Q^ Q 2 Qa

Approximately 25% of the values in a ranked data set are less than (2i an d about 75% are
greater than Q x . The second quartile, Q 2 , divides a ranked data set into two equal parts; hence,
the second quartile and the median are the same. Approximately 75% of the data values are less
than Q 3 and about 25% are greater than Q 3 .

The difference between the third quartile and the first quartile for a data set is called the
interquartile range (IQR).

Calculating Interquartile Range The difference between the third and the first quartiles gives the
interquartile range; that is,

IQR = Interquartile range = Q 3 — <2i
Examples 3-20 and 3-21 show the calculation of the quartiles and the interquartile range.

■ EXAMPLE 3-20

Refer to Table 3.3 in Example 3-5, which gives the 2008 profits (rounded to billions of dol-
lars) of 12 companies selected from all over the world. That table is reproduced below.

Finding quartiles and the
interquartile range.

Company

2008 Profits
(billions of dollars)

Merck & Co

IBM

Unilever

Microsoft

Petrobras

Exxon Mobil

Lukoil

AT&T

Nestle

Vodafone

Deutsche Bank

China Mobile

(a) Find the values of the three quartiles. Where does the 2008 profits of Merck & Co
fall in relation to these quartiles?

(b) Find the interquartile range.

Solution

(a) First we rank the given data in increasing order. Then we calculate the three quartiles as
follows:

Values less than the median Values greater than the median

Finding quartiles for an even
number of data values.

I~7 8 9 10 11 12~l | HI 13 14 17 17 4sl

9+10 12+13 14 + 17

g, = —r— = 9.5 Q 2 = = 12.5 Q 3 = = 15.5

2 2 | 2

Also the median

112 Chapter 3 Numerical Descriptive Measures

Finding the interquartile range.

The value of Q 2 , which is also the median, is given by the value of the middle term
in the ranked data set. For the data of this example, this value is the average of the
sixth and seventh terms. Consequently, Q 2 is $12.5 billion. The value of Q l is given
by the value of the middle term of the six values that fall below the median (or Q 2 ).
Thus, it is obtained by taking the average of the third and fourth terms. So, Q x is $9.5
billion. The value of Q 3 is given by the value of the middle term of the six values that
fall above the median. For the data of this example, Q 3 is obtained by taking the av-
erage of the ninth and tenth terms, and it is $15.5 billion.

The value of Q x = $9.5 billion indicates that 25% of the companies in this sample
had 2008 profits less than $9.5 billion and 75% of the companies had 2008 profits
higher than $9.5 billion. Similarly, we can state that half of these companies had 2008
profits less than $12.5 billion and the other half had profits greater than $12.5 billion
since the second quartile is $12.5 billion. The value of Q 3 = $15.5 billion indicates
that 75% of the companies had 2008 profits less than $15.5 billion and 25% had prof-
its greater than this value.

By looking at the position of $8 billion, which is the 2008 profit of Merck & Co,
we can state that this value lies in the bottom 25% of the profits for 2008.
(b) The interquartile range is given by the difference between the values of the third and
the first quartiles. Thus,

IQR = Interquartile range = Q 3 - Q l = 15.5 - 9.5 = $6 billion ■

Finding quartiles and the
interquartile range.

■ EXAMPLE 3-21

The following are the ages (in years) of nine employees of an insurance company:
47 28 39 51 33 37 59 24 33

(a) Find the values of the three quartiles. Where does the age of 28 years fall in relation
to the ages of these employees?

(b) Find the interquartile range.

Finding quartiles for an odd

number of data values.

Solution

(a) First we rank the given data in increasing order. Then we calculate the three quartiles
as follows:

Values less than the median

28 + 33
2

30.5

Values greater than the median

Also the median

Q 3

47 + 51

Finding the interquartile range.

Thus the values of the three quartiles are

Qi = 30.5 years, Q 2 = 37 years, and g 3 = 49 years

The age of 28 falls in the lowest 25% of the ages,
(b) The interquartile range is

IQR = Interquartile range = g 3 - Q x = 49 - 30.5 = 18.5 years

3.5 Measures of Position

3.5.2 Percentiles and Percentile Rank

Percentiles are the summary measures that divide a ranked data set into 100 equal parts. Each
(ranked) data set has 99 percentiles that divide it into 100 equal parts. The data should be ranked
in increasing order to compute percentiles. The kth percentile is denoted by P k , where k is an
integer in the range 1 to 99. For instance, the 25th percentile is denoted by P 25 - Figure 3.12
shows the positions of the 99 percentiles.

Each of these portions contains 1% of the observations of a data
ill set arranged in increasing order | | |
. T , T . T . 3J , T . T . T

1% 1% 1%

p -\ P 2 P 3 P 97 P 9S P 99

Figure 3.12 Percentiles.

Thus, the A:th percentile, P h can be defined as a value in a data set such that about k% of
the measurements are smaller than the value of P k and about (100 — k)% of the measurements
are greater than the value of P k .

The approximate value of the kth percentile is determined as explained next.

Calculating Percentiles The (approximate) value of the kth percentile, denoted by P k , is

P t = Value of the th term in a ranked data set

.100,

where k denotes the number of the percentile and n represents the sample size.
Example 3-22 describes the procedure to calculate the percentiles.

■ EXAMPLE 3-22

Refer to the data on 2008 profits for 12 companies given in Example 3-20. Find the value of

Fiiiuins the percentile for o

the 42nd percentile. Give a brief interpretation of the 42nd percentile. rfafa se(

Solution From Example 3-20, the data arranged in increasing order are as follows:

7 8 9 10 11 12 13 13 14 17 17 45
The position of the 42nd percentile is

kn 42(12)

100 100

5.04th term

The value of the 5.04th term can be approximated by the value of the 5th term in the ranked
data. Therefore,

P 42 = 42nd percentile = 11 = $11 billion

Thus, approximately 42% of these 12 companies had 2008 profits less than or equal to
$11 billion. ■

We can also calculate the percentile rank for a particular value x, of a data set by using
the formula given below. The percentile rank of x, gives the percentage of values in the data set
that are less than x,-.

Finding Percentile Rank of a Value

Number of values less than x,

Percentile rank of x, =

X 100

Total number of values in the data set

Example 3-23 shows how the percentile rank is calculated for a data value.

114 Chapter 3 Numerical Descriptive Measures

Finding the percentile rank ^ EXAMPLE 3 23

for a data value. Refer to the data on 2008 profits for 12 companies given in Example 3-20. Find the percentile
rank for $14 billion profit of Petrobras. Give a brief interpretation of this percentile rank.

Solution From Example 3-20, the data arranged in increasing order are as follows:

7 8 9 10 11 12 13 13 14 17 17 45
In this data set, 8 of the 12 values are less than $14 billion. Hence,

Percentile rank of 14 = — X 100 = 66.67%
12

Rounding this answer to the nearest integral value, we can state that about 67% of the com-
panies in these 12 had less than $14 billion profits in 2008. Hence, 33% of these 12 compa-
nies had $14 billion or higher profits in 2008. I

EXERCISES

CONCEPTS AND PROCEDURES

3.86 Briefly describe how the three quartiles are calculated for a data set. Illustrate by calculating the three
quartiles for two examples, the first with an odd number of observations and the second with an even num-
ber of observations.

3.87 Explain how the interquartile range is calculated. Give one example.

3.88 Briefly describe how the percentiles are calculated for a data set.

3.89 Explain the concept of the percentile rank for an observation of a data set.

■ APPLICATIONS

3.90 The following data give the weights (in pounds) lost by 15 members of a health club at the end of
2 months after joining the club.

5 10 8 7 25 12 5 14

11 10 21 9 8 11 18

a. Compute the values of the three quartiles and the interquartile range.

b. Calculate the (approximate) value of the 82nd percentile.

c. Find the percentile rank of 10.

3.91 The following data give the speeds of 13 cars (in mph) measured by radar, traveling on 1-84.

73 75 69 68 78 69 74
76 72 79 68 77 71

a. Find the values of the three quartiles and the interquartile range.

b. Calculate the (approximate) value of the 35th percentile.

c. Compute the percentile rank of 7 1 .

3.92 The following data give the numbers of computer keyboards assembled at the Twentieth Century
Electronics Company for a sample of 25 days.

a. Calculate the values of the three quartiles and the interquartile range.

b. Determine the (approximate) value of the 53rd percentile.

c. Find the percentile rank of 50.

3.93 The following data give the numbers of minor penalties accrued by each of the 30 National Hockey
League franchises during the 2007-08 regular season.

318

336

337

339

362

363

366

369

372

375

378

381

384

385

386

387

390

393

395

403

405

409

417

431

433

434

438

444

461

480

3.6 Box-and-Whisker Plot 115

a. Calculate the values of the three quartiles and the interquartile range.

b. Find the approximate value of the 57th percentile.

c. Calculate the percentile rank of 417.

3.94 Refer to Exercise 3.22. The following data give the number of students suspended for bringing
weapons to schools in the Tri-City School District for each of the past 12 weeks.

1 59 1 21 17 6 9 1 01 43 6 5

a. Determine the values of the three quartiles and the interquartile range. Where does the value of
10 fall in relation to these quartiles?

b. Calculate the (approximate) value of the 55th percentile.

c. Find the percentile rank of 7.

3.95 Nixon Corporation manufactures computer monitors. The following data give the numbers of com-
puter monitors produced at the company for a sample of 30 days.

a. Calculate the values of the three quartiles and the interquartile range. Where does the value of
31 lie in relation to these quartiles?

b. Find the (approximate) value of the 65th percentile. Give a brief interpretation of this percentile.

c. For what percentage of the days was the number of computer monitors produced 32 or higher?
Answer by finding the percentile rank of 32.

3.96 The following data give the numbers of new cars sold at a dealership during a 20-day period.

8 5 12 3 9 10 6 12 8 8

4 16 10 11 7 7 3 5 9 11

a. Calculate the values of the three quartiles and the interquartile range. Where does the value of 4
lie in relation to these quartiles?

b. Find the (approximate) value of the 25th percentile. Give a brief interpretation of this percentile.

c. Find the percentile rank of 10. Give a brief interpretation of this percentile rank.

3.97 According to Fair Isaac, "The Median FICO (Credit) Score in the U.S. is 723" (The Credit Scoring
Site, 2009). Suppose the following data represent the credit scores of 22 randomly selected loan applicants.

494 728 468 533 747 639 430 690 604 422 356
805 749 600 797 702 628 625 617 647 772 572

a. Calculate the values of the three quartiles and the interquartile range. Where does the value 617
fall in relation to these quartiles?

b. Find the approximate value of the 30th percentile. Give a brief interpretation of this percentile.

c. Calculate the percentile rank of 533. Give a brief interpretation of this percentile rank.

3.6 Box-and-Whisker Plot

A box-and-whisker plot gives a graphic presentation of data using five measures: the median,
the first quartile, the third quartile, and the smallest and the largest values in the data set be-
tween the lower and the upper inner fences. (The inner fences are explained in Example 3-24
below.) A box-and-whisker plot can help us visualize the center, the spread, and the skewness
of a data set. It also helps detect outliers. We can compare different distributions by making
box-and-whisker plots for each of them.

Definition

Box-and-Whisker Plot A plot that shows the center, spread, and skewness of a data set. It is con-
structed by drawing a box and two whiskers that use the median, the first quartile, the third quar-
tile, and the smallest and the largest values in the data set between the lower and the upper inner
fences.

Example 3-24 explains all the steps needed to make a box-and-whisker plot.

Chapter 3 Numerical Descriptive Measures

Constructing a
box-and-whisker plot.

■ EXAMPLE 3-24

The following data are the incomes (in thousands of dollars) for a sample of 12 house-
holds.

75 69 84 112 74 104 81 90 94 144 79 98
Construct a box-and-whisker plot for these data.

Solution The following five steps are performed to construct a box-and-whisker plot.

Step 1. First, rank the data in increasing order and calculate the values of the median, the
first quartile, the third quartile, and the interquartile range. The ranked data are

69 74 75 79 81 84 90 94 98 104 112 144
For these data,

Median = (84 + 90)/2 = 87
Qi = (75 + 79)/2 = 77
Q 3 = (98 + 104)/2 = 101
IQR = <3 3 - Qi = 101 - 77 = 24

Step 2. Find the points that are 1.5 X IQR below g, and 1.5 X IQR above Q 3 . These
two points are called the lower and the upper inner fences, respectively.

1.5 X IQR = 1.5 X 24 = 36
Lower inner fence = Q x — 36 = 77 — 36 = 41
Upper inner fence = Q 3 + 36 = 101 + 36 = 137

Step 3. Determine the smallest and the largest values in the given data set within the two
inner fences. These two values for our example are as follows:

Smallest value within the two inner fences = 69
Largest value within the two inner fences = 112

Step 4. Draw a horizontal line and mark the income levels on it such that all the values
in the given data set are covered. Above the horizontal line, draw a box with its left side
at the position of the first quartile and the right side at the position of the third quartile. In-
side the box, draw a vertical line at the position of the median. The result of this step is
shown in Figure 3.13.

First Third
quartile quartile
Median

65 75 85 95 105 115 125 135 145

Income

Figure 3.13

Step 5. By drawing two lines, join the points of the smallest and the largest values within
the two inner fences to the box. These values are 69 and 1 12 in this example as listed in
Step 3. The two lines that join the box to these two values are called whiskers. A value
that falls outside the two inner fences is shown by marking an asterisk and is called an out-
lier. This completes the box-and-whisker plot, as shown in Figure 3.14.

3.6 Box-and-Whisker Plot 117

First
quartile

Smallest value
within the two
inner fences

Third
quartile

Median

r— Large
within
T inner

Largest value
the two
fences

An
outlier

95 105 115
Income

125 135 145

Figure 3.14

In Figure 3.14, about 50% of the data values fall within the box, about 25% of the val-
ues fall on the left side of the box, and about 25% fall on the right side of the box. Also, 50%
of the values fall on the left side of the median and 50% lie on the right side of the median.
The data of this example are skewed to the right because the lower 50% of the values are
spread over a smaller range than the upper 50% of the values. H

The observations that fall outside the two inner fences are called outliers. These outliers
can be classified into two kinds of outliers — mild and extreme outliers. To do so, we define two
outer fences — a lower outer fence at 3.0 X IQR below the first quartile and an upper outer
fence at 3.0 X IQR above the third quartile. If an observation is outside either of the two inner
fences but within either of the two outer fences, it is called a mild outlier. An observation that
is outside either of the two outer fences is called an extreme outlier. For the previous example,
the outer fences are at 5 and 173. Because 144 is outside the upper inner fence but inside the
upper outer fence, it is a mild outlier.

For a symmetric data set, the line representing the median will be in the middle of the box
and the spread of the values will be over almost the same range on both sides of the box.

EXERCISES

CONCEPTS AND PROCEDURES

3.98 Briefly explain what summary measures are used to construct a box-and-whisker plot.

3.99 Prepare a box-and-whisker plot for the following data:

Does this data set contain any outliers?

3.100 Prepare a box-and-whisker plot for the following data:

11 8 26 31 62 19 7

33 30 42 15 18 23 29

Does this data set contain any outliers?

14
16

75
6

■ APPLICATIONS

3.101 The following data give the time (in minutes) that each of 20 students selected from a university
waited in line at their bookstore to pay for their textbooks in the beginning of the Fall 2009 semester.

15 8 23 21 5 17 31 22 34 6

5 10 14 17 16 25 30 3 31 19

Prepare a box-and-whisker plot. Comment on the skewness of these data.

3.102 Refer to Exercise 3.97. The following data represent the credit scores of 22 randomly selected loan
applicants.

494 728 468 533 747 639 430 690 604 422 356
805 749 600 797 702 628 625 617 647 772 572

Prepare a box-and-whisker plot. Are these data skewed in any direction?

118 Chapter 3 Numerical Descriptive Measures

3.103 The following data give the recent estimates of crude oil reserves (in billions of barrels) of Saudi
Arabia, Iraq, Kuwait, Iran, United Arab Emirates, Venezuela, Russia, Libya, Nigeria, China, Mexico, and
the United States. The reserves for these countries are listed in that order.

261.7 112.0 97.7 94.4 80.3 64.0

51.2 29.8 27.0 26.8 25.0 22.5

Prepare a box-and-whisker plot. Are the data symmetric or skewed?

3.104 The following data give the numbers of computer keyboards assembled at the Twentieth Century
Electronics Company for a sample of 25 days.

Prepare a box-and-whisker plot. Comment on the skewness of these data.

3.105 Refer to Exercise 3.93. The following data represent the numbers of minor penalties accrued by
each of the 30 National Hockey League franchises during the 2007-08 regular season.

318

336

337

339

362

363

366

369

372

375

378

381

384

385

386

387

390

393

395

403

405

409

417

431

433

434

438

444

461

480

Prepare a box-and-whisker plot. Are these data skewed in any direction?

3.106 Refer to Exercise 3.22. The following data give the number of students suspended for bringing
weapons to schools in the Tri-City School District for each of the past 12 weeks.

1 59 1 21 17 6 9 1 01 43 6 5

Make a box-and-whisker plot. Comment on the skewness of these data.

3.107 Nixon Corporation manufactures computer monitors. The following are the numbers of computer
monitors produced at the company for a sample of 30 days:

Prepare a box-and-whisker plot. Comment on the skewness of these data.

3.108 The following data give the numbers of new cars sold at a dealership during a 20-day period.

8 5 1 2 3 9 1 06 1 28 8

4 1 61 01 17 7 3 5 9 1 1

Make a box-and-whisker plot. Comment on the skewness of these data.

USES AND MISUSES... UNEMPLOYMENT RATES

During his fiscal year 2010 budget address, New Jersey Governor Jon
Corzine made the following statement: "Our unemployment rate is
below the national average. And as of January, at least 18 states had
higher jobless rates than New Jersey— often, significantly higher" (The
2010 State of New Jersey Budget Address, March 10, 2009).

Governor Corzine is certainly not the only politician to reference
statistics during a campaign speech, budget address, or State of the State
address. As a statistically literate member of society, one should be able
to examine a statement and the data to determine its meaning.

According to the Bureau of Labor Statistics, the unemployment
rate in New Jersey was 7.3% in January 2009. The rate was ranked
28th out of the 50 states and the District of Columbia, which was
ranked 43rd with an unemployment rate of 9.2%. Hence, 22 states
had higher jobless rates than New Jersey (BLS, January, 2009).

Governor Corzine stated that the state unemployment rate is be-
low the national average. Based on what you have learned in this

chapter, you might try to obtain the national average by taking the av-
erage of the 51 unemployment rates. However, the average of these
51 numbers was 7.198 as of January 2009, which is not greater than
7.3. One must wonder why the Governor's statement is seemingly false.

Each state's unemployment rate is calculated by finding the
percentage of the state's labor force that is currently unable to find
a job. States such as California, New York, and Texas have many
more people in their labor forces than states such as Wyoming,
North Dakota, and Rhode Island; however, unemployment rates are
on the same scale, primarily for purposes of comparison. Whenever
an average is calculated, each observation contributes equally to the
average, which means that Wyoming and California would con-
tribute equally to the national unemployment rate, when California
should have a much larger impact on the national rate.

When the national unemployment rate is calculated, it meas-
ures the percentage of the country's labor force that is currently un-

Glossary 119

able to obtain a job. The states with larger labor forces will have a
greater impact on the national rate than states with smaller labor
forces. The national unemployment rate in January 2009 was 7.6%,
which is higher than New Jersey's 7.3% unemployment rate. Hence,
the statement made by Governor Corzine (or his speechwriters) is
correct in its intended meaning. However, the statistical terminology
used in the speech is inappropriate.

The other piece of statistical terminology used by Governor
Corzine is associated with unemployment rates being "significantly
higher." The unemployment rates that were higher than 7.3% ranged
from 7.4% (Massachusetts) to 11.6% (Michigan). How much higher
than 7.3% must an unemployment rate be in order to be classified
as "significantly higher?" Would someone who is unemployed answer

this question differently than someone who has a job? In Chapter 9,
we will address the notion of statistical significance, which will pro-
vide us with a method for answering this question.

One other question that is often raised when statistics are
used in a speech or analysis relates to whether the comparisons
made in the analysis are appropriate. For example. New Jersey has
a large number of pharmaceutical firms. Southern New Jersey has
a large agricultural sector. Many statisticians would argue that it
makes more sense to compare agricultural unemployment levels
across states, as well as pharmaceutical (or science and technol-
ogy) unemployment rates, instead of combining them. Choosing
a level of aggregation can have dramatic impact on the interpre-
tation of data.

Glossary

Bimodal distribution A distribution that has two modes.

Box-and-whisker plot A plot that shows the center, spread, and
skewness of a data set with a box and two whiskers using the median,
the first quartile, the third quartile, and the smallest and the largest
values in the data set between the lower and the upper inner fences.

Chebyshev's theorem For any number k greater than 1, at least
(1 — 1/k 2 ) of the values for any distribution lie within k standard de-
viations of the mean.

Coefficient of variation A measure of relative variability that ex-
presses standard deviation as a percentage of the mean.

Empirical rule For a specific bell-shaped distribution, about 68%
of the observations fall in the interval (ju — a) to (/jl + a), about
95% fall in the interval Ou, - 2a) to (/x + 2a), and about 99.7% fall
in the interval (/x — 3a) to (ii + 3a).

First quartile The value in a ranked data set such that about 25%
of the measurements are smaller than this value and about 75% are
larger. It is the median of the values that are smaller than the me-
dian of the whole data set.

Geometric mean Calculated by taking the nth root of the product
of all values in a data set.

Interquartile range (IQR) The difference between the third and
the first quartiles.

Lower inner fence The value in a data set that is 1.5 X IQR be-
low the first quartile.

Lower outer fence The value in a data set that is 3.0 X IQR be-
low the first quartile.

Mean A measure of central tendency calculated by dividing the
sum of all values by the number of values in the data set.

Measures of central tendency Measures that describe the center
of a distribution. The mean, median, and mode are three of the meas-
ures of central tendency.

Measures of dispersion Measures that give the spread of a distri-
bution. The range, variance, and standard deviation are three such
measures.

Measures of position Measures that determine the position of a
single value in relation to other values in a data set. Quartiles, per-
centiles, and percentile rank are examples of measures of position.

Median The value of the middle term in a ranked data set. The
median divides a ranked data set into two equal parts.

Mode The value (or values) that occurs with highest frequency in
a data set.

Multimodal distribution A distribution that has more than two
modes.

Parameter A summary measure calculated for population data.

Percentile rank The percentile rank of a value gives the percent-
age of values in the data set that are smaller than this value.

Percentiles Ninety-nine values that divide a ranked data set into
100 equal parts.

Quartiles Three summary measures that divide a ranked data set
into four equal parts.

Range A measure of spread obtained by taking the difference be-
tween the largest and the smallest values in a data set.

Second quartile Middle or second of the three quartiles that divide
a ranked data set into four equal parts. About 50% of the values in
the data set are smaller and about 50% are larger than the second
quartile. The second quartile is the same as the median.

Standard deviation A measure of spread that is given by the pos-
itive square root of the variance.

Statistic A summary measure calculated for sample data.

Third quartile Third of the three quartiles that divide a ranked
data set into four equal parts. About 75% of the values in a data set
are smaller than the value of the third quartile and about 25% are
larger. It is the median of the values that are greater than the median
of the whole data set.

Trimmed mean The k% trimmed mean is obtained by dropping
k% of the smallest values and k% of the largest values from the given
data and then calculating the mean of the remaining (100 — 2k)%
of the values.

120 Chapter 3 Numerical Descriptive Measures

Unimodal distribution A distribution that has only one mode.

Upper inner fence The value in a data set that is 1.5 X IQR above
the third quartile.

Upper outer fence The value in a data set that is 3.0 X IQR above
the third quartile.

Variance A measure of spread.

Weighted mean Mean of a data set whose values are assigned dif-
ferent weights before the mean is calculated.

Supplementary Exercises

3.109 Each year the faculty at Metro Business College chooses 10 members from the current graduating
class that they feel are most likely to succeed. The data below give the current annual incomes (in thou-
sands of dollars) of the 10 members of the class of 2000 who were voted most likely to succeed.

59 68 84 78 107 382 56 74 97 60

a. Calculate the mean and median.

b. Does this data set contain any outlier(s)? If yes, drop the outlier(s) and recalculate the mean and
median. Which of these measures changes by a greater amount when you drop the outlier(s)?

c. Is the mean or the median a better summary measure for these data? Explain.

3.110 The Belmont Stakes is the final race in the annual Triple Crown of thoroughbred racing. The race
is 1.5 miles in length, and the record for the fastest time (2:24.00 seconds) was set by Secretariat in 1973.
The following data come from the 1999 to 2008 Belmont Stakes winners and show how much faster Sec-
retariat's record time is compared to these winning times (in seconds) for the years 1999 to 2008.

3.80 7.20 2.80 5.71 4.26 3.50 4.75 3.81 4.74 5.65

a. Calculate the mean and median. Do these data have a mode? Why or why not?

b. Compute the variance, standard deviation, and range for these data.

3.111 The following table gives the total points scored by each of the top 10 scorers during the 2007-08
regular season of National Basketball Association (Source: www.NBA.com.)

Player

Points Scored

Kobe Bryant

2323

LeBron James

2250

Allen Iverson

2164

Amare Stoudemire

1989

Carmelo Anthony

1978

Richard Jefferson

1857

Dirk Nowitski

1817

Baron Davis

1791

Jason Richardson

1788

Joe Johnson

1779

a. Calculate the mean and median. Do these data have a mode? Why or why not?

b. Compute the variance, standard deviation, and range for these data.

3.112 The following data give the numbers of driving citations received by 12 drivers.

4 8 3 1 17 4 1 48 1 37 9

a. Find the mean, median, and mode for these data.

b. Calculate the range, variance, and standard deviation.

c. Are the values of the summary measures in parts a and b population parameters or sample
statistics?

3.113 The following table gives the distribution of the amounts of rainfall (in inches) for July 2009 for
50 cities.

Supplementary Exercises 121

R'tinf'ill

IVtl 1111 till

ill LI1I1 1IC1 1)1 v.- l Lies

to less than 2

2 to less than 4

1 A

4 to less than 6

6 to less than 8

8 to less than 10

10 to less than 12

Find the mean, variance, and standard deviation. Are the values of these summary measures population
parameters or sample statistics?

3.114 The following table gives the frequency distribution of the times (in minutes) that 50 commuter students
at a large university spent looking for parking spaces on the first day of classes in the Fall semester of 2009.

Time

Number of Students

to less than 4

4 to less than 8

8 to less than 12

12 to less than 16

16 to less than 20

20 to less than 24

Find the mean, variance, and standard deviation. Are the values of these summary measures population
parameters or sample statistics?

3.115 The mean time taken to learn the basics of a word processor by all students is 200 minutes with a
standard deviation of 20 minutes.

a. Using Chebyshev's theorem, find at least what percentage of students will learn the basics of
this word processor in

i. 160 to 240 minutes ii. 140 to 260 minutes

*b. Using Chebyshev's theorem, find the interval that contains the time taken by at least 75% of
all students to learn this word processor.

3.116 According to an estimate, Americans were expected to spend an average of 1669 hours watching
television in 2004. Assume that the average time spent watching television by Americans in 2009 has a
distribution that is skewed to the right with a mean of 1750 hours and a standard deviation of 450 hours.

a. Using Chebyshev's theorem, find at least what percentage of Americans watched television in
2009 for

i. 850 to 2650 hours ii. 400 to 3100 hours

*b. Using Chebyshev's theorem, find the interval that will contain the television viewing times of
at least 84% of all Americans.

3.117 Refer to Exercise 3.1 15. Suppose the times taken to learn the basics of this word processor by all stu-
dents have a bell-shaped distribution with a mean of 200 minutes and a standard deviation of 20 minutes.

a. Using the empirical rule, find the percentage of students who will learn the basics of this
word processor in

i. 180 to 220 minutes ii. 160 to 240 minutes

*b. Using the empirical rule, find the interval that contains the time taken by 99.7% of all stu-
dents to learn this word processor.

3.118 Assume that the annual earnings of all employees with CPA certification and 6 years of experience
and working for large firms have a bell-shaped distribution with a mean of $134,000 and a standard de-
viation of $12,000.

a. Using the empirical rule, find the percentage of all such employees whose annual earnings are
between

i. $98,000 and $170,000 ii. $110,000 and $158,000
*b. Using the empirical rule, find the interval that contains the annual earnings of 68% of all such
employees.

122 Chapter 3 Numerical Descriptive Measures

3.119 Refer to the data of Exercise 3.109 on the current annual incomes (in thousands of dollars) of the
10 members of the class of 2000 of the Metro Business College who were voted most likely to succeed.

59 68 84 78 107 382 56 74 97 60

a. Determine the values of the three quartiles and the interquartile range. Where does the value of
74 fall in relation to these quartiles?

b. Calculate the (approximate) value of the 70th percentile. Give a brief interpretation of this per-
centile.

c. Find the percentile rank of 97. Give a brief interpretation of this percentile rank.

3.120 Refer to the data given in Exercise 3.111 on the total points scored by the top 10 NBA scorers dur-
ing the 2007-08 NBA regular season.

a. Calculate the values of the three quartiles and the interquartile range. Where does the value
1978 fall in relation to these quartiles?

b. Find the approximate value of the 70th percentile. Give a brief interpretation of this percentile.

c. Calculate the percentile rank of 1788. Give a brief interpretation of this percentile rank.

3.121 A student washes her clothes at a laundromat once a week. The data below give the time (in min-
utes) she spent in the laundromat for each of 15 randomly selected weeks. Here, time spent in the laun-
dromat includes the time spent waiting for a machine to become available.

75 62 84 73 107 81 93 72
135 77 85 67 90 83 112

Prepare a box-and-whisker plot. Is the data set skewed in any direction? If yes, is it skewed to the right
or to the left? Does this data set contain any outliers?

3.122 The following data give the lengths of time (in weeks) taken to find a full-time job by 18 computer
science majors who graduated in 2008 from a small college.

30 43 32 21 65 8 4 18 16
38 9 44 33 23 24 81 42 55

Make a box-and-whisker plot. Comment on the skewness of this data set. Does this data set contain any
outliers?

Advanced Exercises

3.123 Melissa's grade in her math class is determined by three 100-point tests and a 200-point final exam.
To determine the grade for a student in this class, the instructor will add the four scores together and di-
vide this sum by 5 to obtain a percentage. This percentage must be at least 80 for a grade of B. If Melissa's
three test scores are 75, 69, and 87, what is the minimum score she needs on the final exam to obtain a
B grade?

3.124 Jeffrey is serving on a six-person jury for a personal-injury lawsuit. All six jurors want to award
damages to the plaintiff but cannot agree on the amount of the award. The jurors have decided that each
of them will suggest an amount that he or she thinks should be awarded; then they will use the mean of
these six numbers as the award to recommend to the plaintiff.

a. Jeffrey thinks the plaintiff should receive $20,000, but he thinks the mean of the other five
jurors' recommendations will be about $12,000. He decides to suggest an inflated amount so
that the mean for all six jurors is $20,000. What amount would Jeffrey have to suggest?

b. How might this jury revise its procedure to prevent a juror like Jeffrey from having an undue
influence on the amount of damages to be awarded to the plaintiff?

3.125 The heights of five starting players on a basketball team have a mean of 76 inches, a median of
78 inches, and a range of 11 inches.

a. If the tallest of these five players is replaced by a substitute who is 2 inches taller, find the new
mean, median, and range.

b. If the tallest player is replaced by a substitute who is 4 inches shorter, which of the new values
(mean, median, range) could you determine, and what would their new values be?

3.126 On a 300-mile auto trip, Lisa averaged 52 mph for the first 100 miles, 65 mph for the second 100
miles, and 58 mph for the last 100 miles.

a. How long did the 300-mile trip take?

b. Could you find Lisa's average speed for the 300-mile trip by calculating (52 + 65 + 58)/3? If
not, find the correct average speed for the trip.

Supplementary Exercises 123

3.127 A small country bought oil from three different sources in one week, as shown in the following table.

Source

Barrels Purchased

Price per Barrel ($)

Mexico

1000

Kuwait

200

Spot Market

100

Find the mean price per barrel for all 1 300 barrels of oil purchased in that week.

3.128 During the 2008 winter season, a homeowner received four deliveries of heating oil, as shown in
the following table.

Gallons Purchased

Price per Gallon ($)

209

2.22

182

2.34

157

2.41

149

2.43

The homeowner claimed that the mean price he paid for oil during the season was (2.22 + 2.34 + 2.41 +
2.43)/4 = $2.35 per gallon. Do you agree with this claim? If not, explain why this method of calculating
the mean is not appropriate in this case and find the correct value of the mean price.

3.129 In the Olympic Games, when events require a subjective judgment of an athlete's performance, the
highest and lowest of the judges' scores may be dropped. Consider a gymnast whose performance is judged
by seven judges and the highest and the lowest of the seven scores are dropped.

a. Gymnast A's scores in this event are 9.4, 9.7, 9.5, 9.5, 9.4, 9.6, and 9.5. Find this gymnast's
mean score after dropping the highest and the lowest scores.

b. The answer to part a is an example of (approximately) what percentage of trimmed mean?

c. Write another set of scores for a gymnast B so that gymnast A has a higher mean score than
gymnast B based on the trimmed mean, but gymnast B would win if all seven scores were
counted. Do not use any scores lower than 9.0.

3.130 A survey of young people's shopping habits in a small city during the summer months of 2009
showed the following: Shoppers aged 12 to 14 years took an average of 8 shopping trips per month and
spent an average of $14 per trip. Shoppers aged 15 to 17 years took an average of 11 trips per month and
spent an average of $18 per trip. Assume that this city has 1 100 shoppers aged 12 to 14 years and 900 shop-
pers aged 15 to 17 years.

a. Find the total amount spent per month by all these 2000 shoppers in both age groups.

b. Find the mean number of shopping trips per person per month for these 2000 shoppers.

c. Find the mean amount spent per person per month by shoppers aged 12 to 17 years in this city.

3.131 The following table shows the total population and the number of deaths (in thousands) due to heart
attack for two age groups (in years) in Countries A and B for 2009.

Age 30 and Under

Age 31 and Over

Population

40,000

25,000

20,000

35,000

Deaths due to heart attack

1000

500

2000

3000

a. Calculate the death rate due to heart attack per 1000 population for the 30 years and under age
group for each of the two countries. Which country has the lower death rate in this age group?

b. Calculate the death rates due to heart attack for the two countries for the 3 1 years and over age
group. Which country has the lower death rate in this age group?

c. Calculate the death rate due to heart attack for the entire population of Country A; then do the
same for Country B. Which country has the lower overall death rate?

d. How can the country with lower death rate in both age groups have the higher overall death
rate? (This phenomenon is known as Simpson's paradox.)

124 Chapter 3 Numerical Descriptive Measures

3.132 In a study of distances traveled to a college by commuting students, data from 100 commuters
yielded a mean of 8.73 miles. After the mean was calculated, data came in late from three students, with
respective distances of 11.5, 7.6, and 10.0 miles. Calculate the mean distance for all 103 students.

3.133 The test scores for a large statistics class have an unknown distribution with a mean of 70 and a
standard deviation of 10.

a. Find k so that at least 50% of the scores are within k standard deviations of the mean.

b. Find k so that at most 10% of the scores are more than k standard deviations above the mean.

3.134 The test scores for a very large statistics class have a bell-shaped distribution with a mean of 70 points.

a. If 16% of all students in the class scored above 85, what is the standard deviation of the scores?

b. If 95% of the scores are between 60 and 80, what is the standard deviation?

3.135 How much does the typical American family spend to go away on vacation each year? Twenty-five
randomly selected households reported the following vacation expenditures (rounded to the nearest hun-
dred dollars) during the past year:

2500

500

800

100

200

2200

200

1000

900

321,500

400

500

100

8200

900

1700

1100

600

3400

a. Using both graphical and numerical methods, organize and interpret these data.

b. What measure of central tendency best answers the original question?

3.136 Actuaries at an insurance company must determine a premium for a new type of insurance. A
random sample of 40 potential purchasers of this type of insurance were found to have suffered the
following values of losses (in dollars) during the past year. These losses would have been covered by
the insurance if it were available.

100

470

14,589

212

1127

421

135

420

250

309

177

295

501

143

167

398

141

3709

122

a. Find the mean, median, and mode of these 40 losses.

b. Which of the mean, median, or mode is largest?

c. Draw a box-and-whisker plot for these data, and describe the skewness, if any.

d. Which measure of central tendency should the actuaries use to determine the premium for this
insurance?

3.137 A local golf club has men's and women's summer leagues. The following data give the scores for
a round of 18 holes of golf for 17 men and 15 women randomly selected from their respective leagues.

Men

112

102

Women

101

100

117

107

103

100

a. Make a box-and-whisker plot for each of the data sets and use them to discuss the similarities
and differences between the scores of the men and women golfers.

b. Compute the various descriptive measures you have learned for each sample. How do they
compare?

3.138 Answer the following questions.

a. The total weight of all pieces of luggage loaded onto an airplane is 12,372 pounds, which works
out to be an average of 51.55 pounds per piece. How many pieces of luggage are on the plane?

b. A group of seven friends, having just gotten back a chemistry exam, discuss their scores. Six of
the students reveal that they received grades of 81, 75, 93, 88, 82, and 85, respectively, but the
seventh student is reluctant to say what grade she received. After some calculation she an-
nounces that the group averaged 81 on the exam. What is her score?

Supplementary Exercises 125

3.139 Suppose that there are 150 freshmen engineering majors at a college and each of them will take the
same five courses next semester. Four of these courses will be taught in small sections of 25 students each,
whereas the fifth course will be taught in one section containing all 150 freshmen. To accommodate all
150 students, there must be six sections of each of the four courses taught in 25-student sections. Thus,
there are 24 classes of 25 students each and one class of 150 students.

a. Find the mean size of these 25 classes.

b. Find the mean class size from a student's point of view, noting that each student has five
classes containing 25, 25, 25, 25, and 150 students, respectively.

Are the means in parts a and b equal? If not, why not?

3.140 The following data give the weights (in pounds) of a random sample of 44 college students. (Here
F and M indicate female and male, respectively.)

123 F

195 M

138 M

115 F

179 M

119 F

148 F

147 F

180 M

146 F

179 M

189 M

175 M

108 F

193 M

114 F

179 M

147 M

108 F

128 F

164 F

174 M

128 F

159 M

193 M

204 M

125 F

133 F

115 F

168 M

123 F

183 M

116 F

182 M

174 M

102 F

123 F

99 F

161 M

162 M

155 F

202 M

110 F

132 M

Compute the mean, median, and standard deviation for the weights of all students, of men only, and of
women only. Of the mean and median, which is the more informative measure of central tendency? Write
a brief note comparing the three measures for all students, men only, and women only.

3.141 The distribution of the lengths of fish in a certain lake is not known, but it is definitely not bell
shaped. It is estimated that the mean length is 6 inches with a standard deviation of 2 inches.

a. At least what proportion of fish in the lake are between 3 inches and 9 inches long?

b. What is the smallest interval that will contain the lengths of at least 84% of the fish?

c. Find an interval so that fewer than 36% of the fish have lengths outside this interval.

3.142 The following stem-and-leaf diagram gives the distances (in thousands of miles) driven during the
past year by a sample of drivers in a city.

3 6 9

1 2 8 5 1 5

2 5 16

3 8

4 1
5

6 2

a. Compute the sample mean, median, and mode for the data on distances driven.

b. Compute the range, variance, and standard deviation for these data.

c. Compute the first and third quartiles.

d. Compute the interquartile range. Describe what properties the interquartile range has. When
would it be preferable to using the standard deviation when measuring variation?

3.143 Refer to the data in Problem 3.140. Two individuals, one from Canada and one from England, are
interested in your analysis of these data but they need your results in different units. The Canadian indi-
vidual wants the results in grams (1 pound = 435.59 grams), while the English individual wants the re-
sults in stone (1 stone = 14 pounds).

a. Convert the data on weights from pounds to grams, and then recalculate the mean, median, and
standard deviation of weight for males and females separately. Repeat the procedure, changing
the unit from pounds to stones.

b. Convert your answers from Problem 3.140 to grams and stone. What do you notice about these
answers and your answers from part a?

c. What happens to the values of the mean, median, and standard deviation when you convert
from a larger unit to a smaller unit (e.g., from pounds to grams)? Does the same thing happen
if you convert from a smaller unit (e.g., pounds) to a larger unit (e.g., stone)?

126 Chapter 3 Numerical Descriptive Measures

Figure 3.15 Stacked dotplot of weights in stone and pounds.

d. Figure 3.15 gives a stacked dotplot of these weights in pounds and stone. Which of these two
distributions has more variability? Use your results from parts a to c to explain why this is
the case.

e. Now consider the weights in pounds and grams. Make a stacked dotplot for these data and an-
swer part d.

3.144 Although the standard workweek is 40 hours a week, many people work a lot more than 40 hours
a week. The following data give the numbers of hours worked last week by 50 people.

40.5 41.3

43.7 43.9

47.8 48.2
50.8 51.5
54.4 54.8

41.4 41.5
45.0 45.0
48.3 48.8

51.5 52.3
55.0 55.4

42.0 42.2

45.2 45.8
49.0 49.2

52.3 52.6

55.4 55.4

42.4 42.4

45.9 46.2

49.9 50.1

52.7 52.7

56.2 56.3

42.6 43.3

47.2 47.5

50.6 50.6

53.4 53.9

57.8 58.7

a. The sample mean and sample standard deviation for this data set are 49.012 and 5.080, respec-
tively. Using Chebyshev's theorem, calculate the intervals that contain at least 75%, 88.89%,
and 93.75% of the data.

b. Determine the actual percentages of the given data values that fall in each of the intervals that
you calculated in part a. Also calculate the percentage of the data values that fall within one
standard deviation of the mean.

c. Do you think the lower endpoints provided by Chebyshev's theorem in part a are useful for this
problem? Explain your answer.

d. Suppose that the individual with the first number (54.4) in the fifth row of the data is a worka-
holic who actually worked 84.4 hours last week and not 54.4 hours. With this change now

x = 49.61 and s = 7.10. Recalculate the intervals for part a and the actual percentages for part b.
Did your percentages change a lot or a little?

e. How many standard deviations above the mean would you have to go to capture all 50 data val-
ues? What is the lower bound for the percentage of the data that should fall in the interval, ac-
cording to the Chebyshev theorem.

3.145 Refer to the women's golf scores in Exercise 3.137. It turns out that 117 was mistakenly entered.
Although this person still had the highest score among the 15 women, her score was not a mild or ex-
treme outlier according to the box-and-whisker plot, nor was she tied for the highest score. What are the
possible scores that she could have shot?

3.146 Refer to Problem 5 of the Self-Review Test in Chapter 1, which featured the 10 biggest Nasdaq
losers of October 2008. Let x represent the October 2008 return. Calculate Xx/10. What, if anything, does
this number represent? Explain your answer.

Appendix 127

APPENDIX 3.1

A3.1.1 BASIC FORMULAS FOR THE VARIANCE AND STANDARD
DEVIATION FOR UNGROUPED DATA

Example 3-25 below illustrates how to use the basic formulas to calculate the variance and standard
deviation for ungrouped data. From Section 3.2.2, the basic formulas for variance for ungrouped data are

%{x - /j,) 2 X(x - xf

a~ = and s =

N n-\

where a 2 is the population variance and s 2 is the sample variance.

In either case, the standard deviation is obtained by taking the square root of the variance.

EXAMPLE 3-25 Refer to Example 3-12, where we used the short-cut formulas to compute the vari-
ance and standard deviation for the data on the 2008 market values of five international companies. Cal-
culate the variance and standard deviation for those data using the basic formula.

Solution Let x denote the 2008 market values (in billions of dollars) of these companies. Table 3.14
shows all the required calculations to find the variance and standard deviation.

Calculating the variance and
standard deviation for ungrouped
data using basic formulas.

Table 3.14

(x-x)

(x - xf

75 -

132.4 =

-57.4

3294.76

107

107 -

132.4 =

-25.4

645.16

271

271 -

132.4 =

138.6

19,209.96

138

138 -

132.4 =

5.6

31.36

71 -

132.4 =

-61.4

3769.96

tx = 662

S(jc - xf = 26,951.20

The following steps are performed to compute the variance and standard deviation.
Step 1. Find the mean as follows:

2* 662

x = — = = 132.4

n 5

Step 2. Calculate x — x, the deviation of each value of x from the mean. The results are shown in the
second column of Table 3.14.

Step 3. Square each of the deviations of x from x; that is, calculate each of the (x — xf values. These val-
ues are called the squared deviations, and they are recorded in the third column.

Step 4. Add all the squared deviations to obtain 2(x — xf; that is, sum all the values given in the third
column of Table 3.14. This gives

X(x - xf = 26,951.20

Step 5. Obtain the sample variance by dividing the sum of the squared deviations by n — 1. Thus

„ %(x - xf 26,951.20

s 2 = = = 6737.80

n — 1 5 — 1

Step 6. Obtain the sample standard deviation by taking the positive square root of the variance. Hence,

5 = V6737.80 = 82.0841 = $82.08 billion

128 Chapter 3 Numerical Descriptive Measures

A3.1.2 BASIC FORMULAS FOR THE VARIANCE AND
STANDARD DEVIATION FOR GROUPED DATA

Example 3-26 demonstrates how to use the basic formulas to calculate the variance and standard devia-
tion for grouped data. The basic formulas for these calculations are

X/(m- M ) 2 M™-x) 2

a = and s~ =

N n - I

where cr is the population variance, s 2 is the sample variance, m is the midpoint of a class, and / is the

frequency of a class.

In either case, the standard deviation is obtained by taking the square root of the variance.

EXAMPLE 3-26 In Example 3-17, we used the short-cut formula to compute the variance and standard
deviation for the data on the numbers of orders received each day during the past 50 days at the office of
a mail-order company. Calculate the variance and standard deviation for those data using the basic formula.

Solution All the required calculations to find the variance and standard deviation appear in Table 3.15.
Table 3.15

Number of

Orders

m — X

(m - x) 2

f(m - x) 2

10-12

-5.64

31.8096

127.2384

13-15

168

-2.64

6.9696

83.6352

16-18

340

.36

.1296

2.5920

19-21

280

3.36

11.2896

158.0544

n = 50

Xmf = 832

£/(« " x) 2
= 371.5200

Calculating the variance and
standard deviation for grouped
data using basic formulas.

The following steps are performed to compute the variance and standard deviation using the basic formula.

Step 1. Find the midpoint of each class. Multiply the corresponding values of m and/ Find %mf. From
Table 3.15, tmf = 832.

Step 2. Find the mean as follows:

x = Xmf/n = 832/50 = 16.64

Step 3. Calculate m — x, the deviation of each value of m from the mean. These calculations are done
in the fifth column of Table 3.15.

Step 4. Square each of the deviations m — x; that is, calculate each of the (m — x) values. These are
called squared deviations, and they are recorded in the sixth column.

Step 5. Multiply the squared deviations by the corresponding frequencies (see the seventh column of
Table 3.15). Adding the values of the seventh column, we obtain

%f(m - xf = 371.5200

Step 6. Obtain the sample variance by dividing Xf(m — x) 2 by n — 1. Thus,

S/(m - x) 2 371.5200

.v 2 = = = 7.5820

n - 1 50-1

Step 7. Obtain the standard deviation by taking the positive square root of the variance.

s = Vs 2 = V7.5820 = 2.75 orders ■

Self-Review Test

1. The value of the middle term in a ranked data set is called the
a. mean b. median c. mode

2. Which of the following summary measures is/are influenced by extreme values?
a. mean b. median c. mode d. range

Self-Review Test 129

3. Which of the following summary measures can be calculated for qualitative data?
a. mean b. median c. mode

4. Which of the following can have more than one value?
a. mean b. median c. mode

5. Which of the following is obtained by taking the difference between the largest and the smallest val-
ues of a data set?

a. variance b. range c. mean

6. Which of the following is the mean of the squared deviations of x values from the mean?
a. standard deviation b. population variance c. sample variance

7. The values of the variance and standard deviation are

a. never negative b. always positive c. never zero

8. A summary measure calculated for the population data is called

a. a population parameter b. a sample statistic c. an outlier

9. A summary measure calculated for the sample data is called a

a. population parameter b. sample statistic c. box-and-whisker plot

10. Chebyshev's theorem can be applied to

a. any distribution b. bell-shaped distributions only c. skewed distributions only

11. The empirical rule can be applied to

a. any distribution b. bell-shaped distributions only c. skewed distributions only

12. The first quartile is a value in a ranked data set such that about

a. 75% of the values are smaller and about 25% are larger than this value

b. 50% of the values are smaller and about 50% are larger than this value

c. 25% of the values are smaller and about 75% are larger than this value

13. The third quartile is a value in a ranked data set such that about

a. 75% of the values are smaller and about 25% are larger than this value

b. 50% of the values are smaller and about 50% are larger than this value

c. 25% of the values are smaller and about 75% are larger than this value

14. The 75th percentile is a value in a ranked data set such that about

a. 75% of the values are smaller and about 25% are larger than this value

b. 25% of the values are smaller and about 75% are larger than this value

15. The following data give the numbers of times 10 persons used their credit cards during the past 3
months.

9 6 28 14 2 18 7 3 16 6
Calculate the mean, median, mode, range, variance, and standard deviation.

16. The mean, as a measure of central tendency, has the disadvantage of being influenced by extreme
values. Illustrate this point with an example.

17. The range, as a measure of spread, has the disadvantage of being influenced by extreme values.
Illustrate this point with an example.

18. When is the value of the standard deviation for a data set zero? Give one example of such a data set.
Calculate the standard deviation for that data set to show that it is zero.

19. The following table gives the frequency distribution of the numbers of computers sold during the past
25 weeks at a computer store.

Computers Sold

Frequency

4 to 9

10 to 15

16 to 21

22 to 27

28 to 33

a. What does the frequency column in the table represent?

b. Calculate the mean, variance, and standard deviation.

130

Chapter 3 Numerical Descriptive Measures

20. The cars owned by all people living in a city are, on average, 7.3 years old with a standard deviation
of 2.2 years.

a. Using Chebyshev's theorem, find at least what percentage of the cars in this city are
i. 1.8 to 12.8 years old ii. .7 to 13.9 years old

b. Using Chebyshev's theorem, find the interval that contains the ages of at least 75% of the cars
owned by all people in this city.

21. The ages of cars owned by all people living in a city have a bell-shaped distribution with a mean of
7.3 years and a standard deviation of 2.2 years.

a. Using the empirical rule, find the percentage of cars in this city that are
i. 5.1 to 9.5 years old ii. .7 to 13.9 years old

b. Using the empirical rule, find the interval that contains the ages of 95% of the cars owned by
all people in this city.

22. The following data give the number of times the metal detector was set off by passengers at a small
airport during 15 consecutive half-hour periods on February 1, 2009.

7 2 1 21 3 8 1
1 53 5 1 42 01 1 14

a. Calculate the three quartiles and the interquartile range. Where does the value of 4 lie in rela-
tion to these quartiles?

b. Find the (approximate) value of the 60th percentile. Give a brief interpretation of this value.

c. Calculate the percentile rank of 12. Give a brief interpretation of this value.

23. Make a box-and-whisker plot for the data on the number of times passengers set off the airport metal
detector given in Problem 22. Comment on the skewness of this data set.

*24. The mean weekly wages of a sample of 15 employees of a company are $1035. The mean weekly
wages of a sample of 20 employees of another company are $1090. Find the combined mean for these
35 employees.

*25. The mean GPA of five students is 3.21. The GPAs of four of these five students are, respectively,
3.85, 2.67, 3.45, and 2.91. Find the GPA of the fifth student.

*26. The following are the prices (in thousands of dollars) of 10 houses sold recently in a city:

479 366 238 207 287 349 293 2534 463 538

Calculate the 10% trimmed mean for this data set. Do you think the 10% trimmed mean is a better sum-
mary measure than the (simple) mean (i.e., the mean of all 10 values) for these data? Briefly explain why
or why not.

*27. Consider the following two data sets.

Data Set I: 8 16 20 35

Data Set II: 5 13 17 32

Note that each value of the second data set is obtained by subtracting 3 from the corresponding value of
the first data set.

a. Calculate the mean for each of these two data sets. Comment on the relationship between the
two means.

b. Calculate the standard deviation for each of these two data sets. Comment on the relationship
between the two standard deviations.

Mini-Projects

■ MINI-PROJECT 3-1

Refer to the data you collected for Mini-Project 1-1 of Chapter 1 and analyzed graphically in Mini-Project 2-1
of Chapter 2. Write a report summarizing those data. This report should include answers to at least the
following questions.

a. Calculate the summary measures (mean, standard deviation, five-number summary, interquartile
range) for the variable you graphed in Mini-Project 2-1. Do this for the entire data set, as well as
for the different groups formed by the categorical variable that you used to divide the data set in
Mini-Project 2-1.

Decide for Yourself

b. Are the summary measures for the various groups similar to those for the entire data set? If not,
which ones differ and how do they differ? Make the same comparisons among the summary
measures for various groups. Do the groups have similar levels of variability? Explain how you
can determine this from the graphs that you created in Mini-Project 2—1.

C. Draw a box-and-whisker plot for the entire data set. Also draw side-by-side box-and-whisker plots
for the various groups. Are there any outliers? If so, are there any values that are outliers in any
of the groups but not in the entire data set? Does the plot show any skewness?

d. Discuss which measures for the center and spread would be more appropriate to use to describe
your data set. Also, discuss your reasons for using those measures.

■ MINI-PROJECT 3-2

You are employed as a statistician for a company that makes household products, which are sold by part-
time salespersons who work during their spare time. The company has four salespersons employed in a
small town. Let us denote these salespersons by A, B, C, and D. The sales records (in dollars) for the past
6 weeks for these four salespersons are shown in the following table.

Week

1774

2205

1330

1402

1808

1507

1295

1665

1890

2352

1502

1530

1932

1939

1104

1826

1855

2052

1189

1703

1726

1630

1441

1498

Your supervisor has asked you to prepare a brief report comparing the sales volumes and the consistency
of sales of these four salespersons. Use the mean sales for each salesperson to compare the sales volumes,
and then choose an appropriate statistical measure to compare the consistency of sales. Make the calcu-
lations and write a report.

DECIDE FOR YOURSELF
Deciding Where to Live

By the time you get to college, you must have heard it over and over
again: "A picture is worth a thousand words." Now we have pictures
and numbers discussed in Chapters 2 and 3, respectively. Why both?
Well, although each one of them acts as a summary of a data set, it
is a combination of the pictures and numbers that tells a big part of
the story without having to look at the entire data set. Suppose that
you ask a realtor for information on the prices of homes in two dif-
ferent but comparable midwestern suburbs. Let us call these Suburbs
A and B. The realtor provides you with the following information
that is obtained from a random sample of 40 houses in each suburb:

a. The average price of homes in each of the two suburbs

b. The five-number summary of prices of homes in each
neighborhood

c. The histogram of the distribution of home prices for each suburb

All the information provided by the realtor is given in the following
two tables and two histograms shown in Figures 3.16 and 3.17. Note
that the second table gives the minimum and maximum prices of
homes (in thousands of dollars) for each suburb along with the val-
ues of Qj, median, and Q 3 (in thousands of dollars).

Suburb

Average Price (in thousands of dollars)

221.9

220.03

Minimum Q l Median

Q 3

Maximum

Suburb A 151.0 175.5 188.0
Suburb B 187.0 210.0 222.5

199.5
228.0

587.0
250.0

Before you decide which suburb you should buy the house in,
answer the following questions:

1. Examine the summary statistics and graphs given here.

2. Explain how the information given here can help you to make a
decision about the suburb where you should look for a house to buy.

3. Explain how and why you might be misled by simply looking at the
average prices if you are looking to spend less money to buy a house.

4. Is there any information about the suburbs not given here that you
will like to obtain before making a decision about the suburb where
you should buy a house?

132 Chapter 3 Numerical Descriptive Measures

300 400
Prices of Homes

18 .~
16-
14-
12-

i io-

Q" „

a 8 -
£

4-
2-
0--

2D0 220
Prices of Homes

Figure 3.16 Histogram of prices of homes in Suburb A.

Figure 3.17 Histogram of prices of homes in Suburb B.

ECHNOLOGY

INSTRUCTION

1-Uar Stats
x=6. 333333333
Ix=41
2x2=377
Sx=4. 400757511
(tx=4. 017323598

in=6

Screen 3.1

1-Uar Stats
tn=6

ninX=2

Qi=3

Med=6

Gh = ll

naxX=13

Screen 3.2

Numerical Descriptive Measures

1. To calculate the sample statistics (e.g., mean, standard deviation, and five -number sum-
mary), first enter your data into a list such as LI, then select STAT >CALC >1-Var
Stats, and press Enter. Access the name of your list by pressing 2nd >STAT and scroll-
ing through the list of names until you get to your list name. Press ENTER. You will
obtain the output shown in Screens 3.1 and 3.2.

Screen 3.1 shows, in this order, the sample mean, the sum of the data values, the
sum of the squared data values, the sample standard deviation, the value of the popula-
tion standard deviation (you will use this only when your data constitute a census
instead of a sample), and the number of data values (e.g., the sample or population
size). Pressing the downward arrow key will show the five-number summary, which is
shown in Screen 3.2.

2. Constructing a box-and-whisker plot is similar to constructing
a histogram. First enter your data into a list such as LI, then
select STAT PLOT and go into one of the three plots. Make
sure the plot is turned on. For the type, select the second row,
first column (this boxplot will display outliers, if there are
any). Enter the name of your list for XList. Select ZOOM>9
to display the plot as shown in Screen 3.3.

Screen 3.3

To find the sample statistics (e.g., the mean, standard deviation, and five-number summary),
first enter the given data in a column such as CI, and then select Stat >Basic Statistics >
Display Descriptive Statistics. In the dialog box you obtain, enter the name of the column
where your data are stored in the Variables box as shown in Screen 3.4. Click the
Statistics button in this dialog box and choose the summary measures you want to

Technology Instruction 133

Screen 3.4

Display Descriptive Statistics

data

Variables:

data

By variables (optional):

Select

Statisti

Help

cs... | Graphs...

Cancel

calculate in the new dialog box as shown in Screen 3.5. Click OK in both dialog boxes.
The output will appear in the Session window, which is shown in Screen 3.6.

Screen 3.5

Descriptive Statistics - Statistics

F Mean

V Trimmed mean

F N nonmissing

I - SE of mean

r Sum

r N

missing

W Standard deviation

F Minimum

F N total

V Variance

F Maximum

V Cumulative N

I - Coefficient of variation

I - Range

f - Percent

V Cumulative percent

F First quartile

V Sum of squares

F Median

I - Skewness

F Third quartile

I - Kurtosis

F Interquartile range

r MSSD

Help

Cancel

IS Session

Descriptive Statistics:

data

Total

Variable Count Mean

StDev Minimum

Ql Hedian

Maximum IQR

data 5 6.33

4.40 2.00

2.75 6.00

11.50

13.00 8.75

Screen 3.6

1 34 Chapter 3 N umerical Descriptive Measures

2. To create a box-and-whisker plot, enter the given data in a column such as CI, select
Graph >Boxplot >Simple, and click OK. In the dialog box you obtain, enter the name
of the column with data in the Graph Variables box (see Screen 3.7) and click OK. The
boxplot shown in Screen 3.8 will appear.

Screen 3.7

Boxplot - One Y, Simple

Graph variables:

Scale..

Labels..

Data View..

Multiple Graphs..

Data Options..

Help

Cancel

Screen 3.8

Boxplot of data

Calculating Summary Statistics Using the Excel Analysis ToolPak Add-in

1. Click the Data tab. Click Data Analysis in the Analysis group. The Data Analysis menu
will open (see Screen 3.9).

2. Select Descriptive Statistics. Click OK. The Descriptive Statistics window will open (see
Screen 3.10). Click in the Input Range box. Select the range where your data are located.
(Note: the easiest way to do this is to highlight the data with your mouse.) Select Rows or
Columns to identify whether the data are grouped in rows or columns.

Technology Instruction 135

Data Analysis

Analysis Tools

Anova: Single Factor

Anova: Two-Factor With Replication

Anova: Two-Factor Without Replication

Correlation

Covariance

Descriptive Statistics

Exponential Smoothing

F-Test Two-Sample for Variances

Fourier Analysis

j-fistiQgram

Cancel

Help

Screen 3.10

Descriptive Statistics

Input

Input Range:
Grouped By:

Labels in first row

$A$3:$A$8

Columns
O Rows

Output options

O Output Range:

New Workbook

[^1 Summary statistics

1 I Confidence Level for Mean:

□ Kth Largest:

□ Kth Smallest:

my output

?j[x

Cancel

Help

3. Select where you want Excel to place the output. You can select a specific range in the
current spreadsheet, a new spreadsheet within the current Excel workbook, or a new Excel
workbook.

4. Click Summary Statistics. Click OK (see Screen 3.11 for an example of the output).

5. The Analysis ToolPak does not calculate the first and third quartiles. To do this, go to an
empty cell in the spreadsheet. Then

a. Type =quartile(

b. Select the range of data and then type a comma

C. Type 1 for the first quartile or 3 for the third quartile
d. Type a right parenthesis, and then press Enter.

6. To find the Mi percentile:

a. Type =percentile(

b. Select the range of data and then type a comma
C. Type the value of k

d. Type a right parenthesis, and then press Enter.

1 36 Chapter 3 Numerical Descriptive Measures

Column 1

Mean

6.S33333

Standard Error

1.796602

Median

Mode

#N/A

Standard Deviation

4.400753

Sample Variance

19.36667

Kurtosis

-1.54594

Skewness

0.4626S

Range

Minimum

Maximum

Sum

Count

V J

TECHNOLOGY ASSIGNMENTS

TA3.1 Refer to the subsample taken in the Technology Assignment TA2.3 of Chapter 2 from the sam-
ple data on the time taken to run the Manchester Road Race. Find the mean, median, range, and standard
deviation for those data.

TA3.2 Refer to the data on phone charges given in Data Set I. From that data set select the 4th value
and then select every 10th value after that (i.e., select the 4th, 14th, 24th, 34th, ... values). Such a sample
taken from a population is called a systematic random sample. Find the mean, median, standard deviation,
first quartile, and third quartile for the phone charges for this subsample.

TA3.3 Refer to Data Set I on the prices of various products in different cities across the country. Select
a subsample of the prices of regular unleaded gas for 40 cities. Find the mean, median, and standard de-
viation for the data of this subsample.

TA3.4 Refer to Data of TA3.3. Make a box-and-whisker plot for those data.

TA3.5 Refer to Data Set I on the prices of various products in different cities across the country. Make
a box-and-whisker plot for the data on the monthly telephone charges.

TA3.6 Refer to the data on the numbers of computer keyboards assembled at the Twentieth Century Elec-
tronics Company for a sample of 25 days given in Exercise 3.104. Prepare a box-and-whisker plot for
those data.

TA3.7 Refer to Data Set VII on the stocks included in the Standard & Poor's 100 Index. Calculate the
mean, median, standard deviation, and interquartile range for the data on the highest prices for the stocks
in each of the market sectors. Compare the values of the various statistics for different sectors. Create a
stacked dotplot of the highest prices for various sectors with each sector's data as one set of data. Explain
how the results of your comparisons can be seen in the dotplot.

TA3.8 Refer to Data Set III on the National Basketball Association. Calculate the mean, median, stan-
dard deviation, and interquartile range for the players' ages separately for each of the three primary posi-
tions (center, forward, and guard). Is there a position that tends to have younger players, on average, than
the other positions? Is there a position that tends to have less variability in the ages of the players?

TA3.9 Calculate the five-number summaries, the values of the upper and lower inner fences, and the val-
ues of the upper and lower outer fences for the data referred to in TA3.8. Create side-by-side boxplots for
the data on three primary positions. Using these boxplots, compare the shapes of the age distributions for
the three positions. Are there any outliers? If so, classify the outliers as being mild or extreme.

Probability

Do you always come to a complete stop when approaching a stop sign? Although we all know
that we should, many of us rarely make a complete stop at stop signs. In a survey conducted by
Consumer Reports, 17% of men and 13% of women indicated that they often just slow down before
proceeding through a stop sign. On the other hand, 42% of men and 56% of women surveyed said
that they always stop at a stop sign. (See Case Study 4-1.)

We often make statements about probability. For example, a weather forecaster may predict that there
is an 80% chance of rain tomorrow. A health news reporter may state that a smoker has a much greater
chance of getting cancer than does a nonsmoker. A college student may ask an instructor about the
chances of passing a course or getting an A if he or she did not do well on the midterm examination.

Probability, which measures the likelihood that an event will occur, is an important part of statistics.
It is the basis of inferential statistics, which will be introduced in later chapters. In inferential statistics,
we make decisions under conditions of uncertainty. Probability theory is used to evaluate the uncer-
tainty involved in those decisions. For example, estimating next year's sales for a company is based
on many assumptions, some of which may happen to be true and others may not. Probability theory
will help us make decisions under such conditions of imperfect information and uncertainty.
Combining probability and probability distributions (which are discussed in Chapters 5 through 7) with
descriptive statistics will help us make decisions about populations based on information obtained
from samples. This chapter presents the basic concepts of probability and the rules for computing
probability.

4.1 Experiment, Outcomes,
and Sample Space

4.2 Calculating Probability

4.3 Counting Rule

4.4 Marginal and
Conditional Probabilities

Case Study 4-1 Rolling Stops

4.5 Mutually Exclusive
Events

4.6 Independent versus
Dependent Events

4.7 Complementary Events

4.8 Intersection of Events
and the Multiplication
Rule

Case Study 4-2 Baseball
Players have "Slumps"
and "Streaks"

4.9 Union of Events and the
Addition Rule

137

138 Chapter 4 Probability

4.1 Experiment, Outcomes, and Sample Space

Quality control inspector Jack Cook of Tennis Products Company picks up a tennis ball from
the production line to check whether it is good or defective. Cook's act of inspecting a tennis
ball is an example of a statistical experiment. The result of his inspection will be that the ball
is either "good" or "defective." Each of these two observations is called an outcome (also called
a basic or final outcome) of the experiment, and these outcomes taken together constitute the
sample space for this experiment.

Definition

Experiment, Outcomes, and Sample Space An experiment is a process that, when performed,
results in one and only one of many observations. These observations are called the outcomes of
the experiment. The collection of all outcomes for an experiment is called a sample space.

A sample space is denoted by S. The sample space for the example of inspecting a tennis
ball is written as

S = {good, defective}

The elements of a sample space are called sample points.

Table 4.1 lists some examples of experiments, their outcomes, and their sample spaces.

Table 4.1 Examples of Experiments, Outcomes, and Sample Spaces

Experiment

Outcomes

Sample Space

Toss a coin once

Head, Tail

S = {Head, Tail)

Roll a die once

1, 2, 3, 4, 5, 6

S = {1, 2, 3, 4, 5, 6)

Toss a coin twice

HH, HT, TH, TT

S = {HH, HT, TH, TT}

Play lottery

Win, Lose

S = {Win, Lose)

Take a test

Pass, Fail

5 = {Pass, Fail}

Select a worker

Male, Female

S = {Male, Female)

The sample space for an experiment can also be illustrated by drawing either a Venn diagram
or a tree diagram. A Venn diagram is a picture (a closed geometric shape such as a rectangle,
a square, or a circle) that depicts all the possible outcomes for an experiment. In a tree diagram,
each outcome is represented by a branch of the tree. Venn and tree diagrams help us understand
probability concepts by presenting them visually. Examples 4-1 through 4-3 describe how to
draw these diagrams for statistical experiments.

■ EXAMPLE 4-1

Draw the Venn and tree diagrams for the experiment of tossing a coin once.

Solution This experiment has two possible outcomes: head and tail. Consequently, the sam-
ple space is given by

S = {H, T}, where H = Head and T = Tail

To draw a Venn diagram for this example, we draw a rectangle and mark two points
inside this rectangle that represent the two outcomes, Head and Tail. The rectangle is
labeled S because it represents the sample space (see Figure 4.1a). To draw a tree diagram,

Drawing Venn and tree
diagrams: one toss of a coin.

4.1 Experiment, Outcomes, and Sample Space 139

we draw two branches starting at the same point, one representing the head and the sec-
ond representing the tail. The two final outcomes are listed at the ends of the branches (see
Figure 4.1b).

Outcomes
Head H

Figure 4.T (a) Venn diagram and (b) tree diagram
for one toss of a coin.

(a)

(b)

■ EXAMPLE 4-2

Draw the Venn and tree diagrams for the experiment of tossing a coin twice.

Solution This experiment can be split into two parts: the first toss and the second toss. Sup-
pose that the first time the coin is tossed, we obtain a head. Then, on the second toss, we can
still obtain a head or a tail. This gives us two outcomes: HH (head on both tosses) and HT
(head on the first toss and tail on the second toss). Now suppose that we observe a tail on the
first toss. Again, either a head or a tail can occur on the second toss, giving the remaining two
outcomes: TH (tail on the first toss and head on the second toss) and TT (tail on both tosses).
Thus, the sample space for two tosses of a coin is

S = {HH, HT, TH, TT}

The Venn and tree diagrams are given in Figure 4.2. Both of these diagrams show the sample
space for this experiment.

Drawing Venn and tree
diagrams: two tosses of a coin.

HH»

• HT

TH*

• TT

(a)

First toss

Second Final
toss outcomes

H HH

(b)

Figure 4.2 (a) Venn diagram and (b) tree diagram for two tosses
of a coin.

■ EXAMPLE 4-3

Suppose we randomly select two workers from a company and observe whether the worker
selected each time is a man or a woman. Write all the outcomes for this experiment. Draw the
Venn and tree diagrams for this experiment.

Solution Let us denote the selection of a man by M and that of a woman by W. We can
compare the selection of two workers to two tosses of a coin. Just as each toss of a coin can
result in one of two outcomes, head or tail, each selection from the workers of this company
can result in one of two outcomes, man or woman. As we can see from the Venn and tree di-
agrams of Figure 4.3, there are four final outcomes: MM, MW, WM, WW. Hence, the sample
space is written as

Drawing Venn and tree
diagrams: two selections.

S = {MM, MW, WM, WW}

140 Chapter 4 Probability

Figure 4.3 (a) Venn diagram and (b) tree dia-
gram for selecting two workers.

4.1.1 Simple and Compound Events

An event consists of one or more of the outcomes of an experiment.
Definition

Event An event is a collection of one or more of the outcomes of an experiment.

An event may be a simple event or a compound event. A simple event is also called an el-
ementary event, and a compound event is also called a composite event.

Simple Event

Each of the final outcomes for an experiment is called a simple event. In other words, a sim-
ple event includes one and only one outcome. Usually, simple events are denoted by E x , E 2 , E 3 ,
and so forth. However, we can denote them by any other letter — that is, by A, B, C, and so forth.

Definition

Simple Event An event that includes one and only one of the (final) outcomes for an experiment
is called a simple event and is usually denoted by E t ,

Example 4-4 describes simple events.

■ EXAMPLE 4-4

Reconsider Example 4-3 on selecting two workers from a company and observing whether
the worker selected each time is a man or a woman. Each of the final four outcomes (MM,
MW, WM, and WW) for this experiment is a simple event. These four events can be denoted
by E x , E 2 , E 3 , and E A , respectively. Thus,

Ei = (MM), E 2 = (MW), E 3 = (WM), and E 4 = (WW) ■

Compound Event

A compound event consists of more than one outcome.
Definition

Compound Event A compound event is a collection of more than one outcome for an experiment.

Compound events are denoted by A, B, C, D,... or by A x , A 2 , A 3 ,..., B lt B 2 , B 3 ,..., and so
forth. Examples 4—5 and 4-6 describe compound events.

MM •

• MW

WM •

• WW

(a)

First
selection

Second Final
selection outcomes

(b)

Illustrating simple events.

4.1 Experiment, Outcomes, and Sample Space 141

■ EXAMPLE 4-5

Reconsider Example 4-3 on selecting two workers from a company and observing whether the
worker selected each time is a man or a woman. Let A be the event that at most one man is se-
lected. Event A will occur if either no man or one man is selected. Hence, the event A is given by

A = {MW, WM, WW}

Because event A contains more than one outcome, it is a compound event. The Venn diagram
in Figure 4.4 gives a graphic presentation of compound event A. M

■ EXAMPLE 4-6

In a group of people, some are in favor of genetic engineering and others are against it. Two
persons are selected at random from this group and asked whether they are in favor of or
against genetic engineering. How many distinct outcomes are possible? Draw a Venn diagram
and a tree diagram for this experiment. List all the outcomes included in each of the follow-
ing events and state whether they are simple or compound events.

(a) Both persons are in favor of genetic engineering.

(b) At most one person is against genetic engineering.

Solution Let

F = a person is in favor of genetic engineering
A = a person is against genetic engineering

This experiment has the following four outcomes:

FF = both persons are in favor of genetic engineering
FA = the first person is in favor and the second is against
AF = the first person is against and the second is in favor
AA = both persons are against genetic engineering

The Venn and tree diagrams in Figure 4.5 show these four outcomes.

Illustrating a compound event-
two selections.

/" \ A

•

/ MW \

I WM WW J

\ • • /

Figure 4.4 Venn diagram for
event A.

Illustrating simple and
compound events: two
selections.

Venn diagram

FF*

• FA

AF*

•AA

Tree diagram Figure 4.5 Venn and tree

First 1 Second Final diagrams,
person i person outcomes

F FF

(a)

(b)

(a) The event "both persons are in favor of genetic engineering" will occur if FF is ob-
tained. Thus,

Both persons are in favor of genetic engineering = {FF}
Because this event includes only one of the final four outcomes, it is a simple event.

(b) The event "at most one person is against genetic engineering" will occur if either none
or one of the persons selected is against genetic engineering. Consequently,

At most one person is against genetic engineering = {FF, FA, AF}

Because this event includes more than one outcome, it is a compound event.

1 42 Chapter 4 Probability

(c) The event "exactly one person is in favor of genetic engineering" will occur if one of
the two persons selected is in favor and the other is against genetic engineering. Hence,
it includes the following two outcomes:

Exactly one person is in favor of genetic engineering = {FA, AF}

Because this event includes more than one outcome, it is a compound event. H

4.1
4.2

EXERCISES

CONCEPTS AND PROCEDURES

Define the following terms: experiment, outcome, sample space, simple event, and compound event.

List the simple events for each of the following statistical experiments in a sample space S.

a. One roll of a die b. Three tosses of a coin c. One toss of a coin and one roll of a die

4.3 A box contains three items that are labeled A, B, and C. Two items are selected at random (without
replacement) from this box. List all the possible outcomes for this experiment. Write the sample space S.

■ APPLICATIONS

4.4 Two students are randomly selected from a statistics class, and it is observed whether or not they
suffer from math anxiety. How many total outcomes are possible? Draw a tree diagram for this experi-
ment. Draw a Venn diagram.

4.5 In a group of adults, some have iPods, and others do not. If two adults are randomly selected from
this group, how many total outcomes are possible? Draw a tree diagram for this experiment.

4.6 A test contains two multiple-choice questions. If a student makes a random guess to answer each
question, how many outcomes are possible? Depict all these outcomes in a Venn diagram. Also draw a
tree diagram for this experiment. (Hint: Consider two outcomes for each question — either the answer is
correct or it is wrong.)

4.7 A box contains a certain number of computer parts, a few of which are defective. Two parts are se-
lected at random from this box and inspected to determine if they are good or defective. How many total
outcomes are possible? Draw a tree diagram for this experiment.

4.8 In a group of people, some are in favor of a tax increase on rich people to reduce the federal deficit
and others are against it. (Assume that there is no other outcome such as "no opinion" and "do not know")
Three persons are selected at random from this group and their opinions in favor or against raising such
taxes are noted. How many total outcomes are possible? Write these outcomes in a sample space S. Draw
a tree diagram for this experiment.

4.9 Draw a tree diagram for three tosses of a coin. List all outcomes for this experiment in a sample
space S.

4.10 Refer to Exercise 4.4. List all the outcomes included in each of the following events. Indicate which
are simple and which are compound events.

a. Both students suffer from math anxiety.

b. Exactly one student suffers from math anxiety.

c. The first student does not suffer and the second suffers from math anxiety.

d. None of the students suffers from math anxiety.

4.11 Refer to Exercise 4.5. List all the outcomes included in each of the following events. Indicate which
are simple and which are compound events.

a. One person has an iPod and the other does not.

b. At least one person has an iPod.

c. Not more than one person has an iPod.

d. The first person has an iPod and the second does not.

4.12 Refer to Exercise 4.6. List all the outcomes included in each of the following events and mention
which are simple and which are compound events.

a. Both answers are correct.

b. At most one answer is wrong.

c. The first answer is correct and the second is wrong.

d. Exactly one answer is wrong.

4.2 Calculating Probability 143

4.13 Refer to Exercise 4.7. List all the outcomes included in each of the following events. Indicate which
are simple and which are compound events.

a. At least one part is good.

b. Exactly one part is defective.

c. The first part is good and the second is defective.

d. At most one part is good.

4.14 Refer to Exercise 4.8. List all the outcomes included in each of the following events and mention
which are simple and which are compound events.

a. At most one person is against a tax increase on rich people.

b. Exactly two persons are in favor of a tax increase on rich people.

c. At least one person is against a tax increase on rich people.

d. More than one person is against a tax increase on rich people.

4.2 Calculating Probability

Probability, which gives the likelihood of occurrence of an event, is denoted by P. The prob-
ability that a simple event E t will occur is denoted by P(Ej), and the probability that a com-
pound event A will occur is denoted by P(A).

Definition

Probability Probability is a numerical measure of the likelihood that a specific event will occur.

1. The probability of an event always lies in the range to 1. M Two Properties of Probability

Whether it is a simple or a compound event, the probability of an event is never less than or
greater than 1. Using mathematical notation, we can write this property as follows.

First Property of Probability

< P(E,) < 1

' P(A) ' 1

An event that cannot occur has zero probability; such an event is called an impossible event.
An event that is certain to occur has a probability equal to 1 and is called a sure event. That is,

For an impossible event M: P(M) =
For a sure event C: P(C) = 1

2. The sum of the probabilities of all simple events (or final outcomes) for an experiment,
denoted by tPiE,), is always 1.

Second Property of Probability For an experiment,

= P{E,) + P(E 2 ) + P(E 3 ) + ■■■=!

From this property, for the experiment of one toss of a coin,

P(H) + P{T) = 1
For the experiment of two tosses of a coin,

P(HH) + P(HT) + P(TH) + P(TT) = 1
For one game of football by a professional team,

P (win) + P(loss) + P (tie) = 1

1 44 Chapter 4 Probability

4.2.1 Three Conceptual Approaches to Probability

The three conceptual approaches to probability are (1) classical probability, (2) the relative fre-
quency concept of probability, and (3) the subjective probability concept. These three concepts
are explained next.

Classical Probability

Many times, various outcomes for an experiment may have the same probability of occurrence.
Such outcomes are called equally likely outcomes. The classical probability rule is applied to
compute the probabilities of events for an experiment for which all outcomes are equally likely.

Definition

Equally Likely Outcomes Two or more outcomes (or events) that have the same probability of
occurrence are said to be equally likely outcomes (or events).

According to the classical probability rule, the probability of a simple event is equal to 1
divided by the total number of outcomes for the experiment. This is obvious because the sum
of the probabilities of all final outcomes for an experiment is 1, and all the final outcomes are
equally likely. In contrast, the probability of a compound event A is equal to the number of out-
comes favorable to event A divided by the total number of outcomes for the experiment.

Find Probability

Total number of outcomes for the experiment

Number of outcomes favorable to A
Total number of outcomes for the experiment

Examples 4-7 through 4—9 illustrate how probabilities of events are calculated using the
classical probability rule.

Classical Probability Rule to

P(E l ) -
P(A)-

Calculating the probability
of a simple event.

I EXAMPLE 4-7

Find the probability of obtaining a head and the probability of obtaining a tail for one toss of
a coin.

Solution The two outcomes, head and tail, are equally likely outcomes. Therefore,

Similarly,

1 1

f(head) = = - = .50

Total number of outcomes 2

P(tail) = | = .50

Calculating the probability
of a compound event.

■ EXAMPLE 4-8

Find the probability of obtaining an even number in one roll of a die.

'If the final answer for the probability of an event does not terminate within three decimal places, usually it is rounded
to four decimal places.

4.2 Calculating Probability 145

Solution This experiment has a total of six outcomes: 1, 2, 3, 4, 5, and 6. All these out-
comes are equally likely. Let A be an event that an even number is observed on the die. Event
A includes three outcomes: 2, 4, and 6; that is,

A = {2, 4, 6}

If any one of these three numbers is obtained, event A is said to occur. Hence,

Number of outcomes included in A 3

P(A)

Total number of outcomes

.50

■ EXAMPLE 4-9

In a group of 500 women, 120 have played golf at least once. Suppose one of these 500 women
is randomly selected. What is the probability that she has played golf at least once?

Solution Because the selection is to be made randomly, each of the 500 women has the
same probability of being selected. Consequently this experiment has a total of 500 equally
likely outcomes. One hundred twenty of these 500 outcomes are included in the event that the
selected woman has played golf at least once. Hence,

120
500

.24

^(selected woman has played golf at least once)

Relative Frequency Concept of Probability

Suppose we want to calculate the following probabilities:

1. The probability that the next car that comes out of an auto factory is a "lemon"

2. The probability that a randomly selected family owns a home

3. The probability that a randomly selected woman has never smoked

4. The probability that an 80-year-old person will live for at least 1 more year

5. The probability that the tossing of an unbalanced coin will result in a head

6. The probability that a randomly selected person owns a sport-utility vehicle (SUV)

These probabilities cannot be computed using the classical probability rule because the various out-
comes for the corresponding experiments are not equally likely. For example, the next car manu-
factured at an auto factory may or may not be a lemon. The two outcomes, "it is a lemon" and "it
is not a lemon," are not equally likely. If they were, then (approximately) half the cars manufac-
tured by this company would be lemons, and this might prove disastrous to the survival of the firm.

Although the various outcomes for each of these experiments are not equally likely, each of
these experiments can be performed again and again to generate data. In such cases, to calculate
probabilities, we either use past data or generate new data by performing the experiment a large
number of times. The relative frequency of an event is used as an approximation for the probabil-
ity of that event. This method of assigning a probability to an event is called the relative frequency
concept of probability. Because relative frequencies are determined by performing an experiment,
the probabilities calculated using relative frequencies may change almost each time an experiment
is repeated. For example, every time a new sample of 500 cars is selected from the production line
of an auto factory, the number of lemons in those 500 cars is expected to be different. However,
the variation in the percentage of lemons will be small if the sample size is large. Note that if we
are considering the population, the relative frequency will give an exact probability.

Using Relative Frequency as an Approximation of Probability If an experiment is repeated n times
and an event A is observed /times, then, according to the relative frequency concept of probability,

Calculating the probability
of a compound event.

P(A)

Examples 4-10 and 4-11 illustrate how the probabilities of events are approximated using
the relative frequencies.

146 Chapter 4 Probability

■ EXAMPLE 4-10

Ten of the 500 randomly selected cars manufactured at a certain auto factory are found to be
lemons. Assuming that the lemons are manufactured randomly, what is the probability that the
next car manufactured at this auto factory is a lemon?

Solution Let n denote the total number of cars in the sample and / the number of lemons
in n. Then,

n = 500 and / = 10
Using the relative frequency concept of probability, we obtain

Pfnext car is a lemon) = — = = .02

This probability is actually the relative frequency of lemons in 500 cars. Table 4.2 lists the
frequency and relative frequency distributions for this example.

Table 4.2 Frequency and Relative Frequency
Distributions for the Sample of Cars

Car

Relative Frequency

Good

490

490/500 = .98

Lemon

10/500 = .02

n = 500

Sum = 1.00

The column of relative frequencies in Table 4.2 is used as the column of approximate prob-
abilities. Thus, from the relative frequency column,

^(next car is a lemon) = .02

^(next car is a good car) = .98 ^

Note that relative frequencies are not exact probabilities but are approximate proba-
bilities unless they are based on a census. However, if the experiment is repeated again and
again, this approximate probability of an outcome obtained from the relative frequency will
approach the actual probability of that outcome. This is called the Law of Large Numbers.

Definition

Law of Large Numbers If an experiment is repeated again and again, the probability of an event
obtained from the relative frequency approaches the actual or theoretical probability.

■ EXAMPLE 4-11

Allison wants to determine the probability that a randomly selected family from New York
State owns a home. How can she determine this probability?

Solution There are two outcomes for a randomly selected family from New York State: "This
family owns a home" and "this family does not own a home." These two events are not equally
likely. (Note that these two outcomes will be equally likely if exactly half of the families in
New York State own homes and exactly half do not own homes.) Hence, the classical proba-
bility rule cannot be applied. However, we can repeat this experiment again and again. In other
words, we can select a sample of families from New York State and observe whether or not
each of them owns a home. Hence, we will use the relative frequency approach to probability.

Approximating probability by
relative frequency: sample data.

Approximating probability by
relative frequency.

4.2 Calculating Probability 147

Suppose Allison selects a random sample of 1000 families from New York State and ob-
serves that 730 of them own homes and 270 do not own homes. Then,

n = sample size = 1000

/ = number of families who own homes = 730

Consequently,

/ 730

P(a randomly selected family owns a home) = — = ^qqq =

Again, note that .730 is just an approximation of the probability that a randomly selected
family from New York State owns a home. Every time Allison repeats this experiment she
may obtain a different probability for this event. However, because the sample size (n = 1000)
in this example is large, the variation is expected to be very small. I

Subjective Probability

Many times we face experiments that neither have equally likely outcomes nor can be repeated
to generate data. In such cases, we cannot compute the probabilities of events using the classi-
cal probability rule or the relative frequency concept. For example, consider the following prob-
abilities of events:

1. The probability that Carol, who is taking a statistics course, will earn an A in the course

2. The probability that the Dow Jones Industrial Average will be higher at the end of the next
trading day

3. The probability that the New York Giants will win the Super Bowl next season

4. The probability that Joe will lose the lawsuit he has filed against his landlord

Neither the classical probability rule nor the relative frequency concept of probability can
be applied to calculate probabilities for these examples. All these examples belong to experi-
ments that have neither equally likely outcomes nor the potential of being repeated. For exam-
ple, Carol, who is taking statistics, will take the test (or tests) only once, and based on that she
will either earn an A or not. The two events "she will earn an A" and "she will not earn an A"
are not equally likely. The probability assigned to an event in such cases is called subjective
probability. It is based on the individual's judgment, experience, information, and belief. Carol
may assign a high probability to the event that she will earn an A in statistics, whereas her in-
structor may assign a low probability to the same event.

Definition

Subjective Probability Subjective probability is the probability assigned to an event based on
subjective judgment, experience, information, and belief.

Subjective probability is assigned arbitrarily. It is usually influenced by the biases, prefer-
ences, and experience of the person assigning the probability.

EXERCISES

CONCEPTS AND PROCEDURES

4.15 Briefly explain the two properties of probability.

4.16 Briefly describe an impossible event and a sure event. What is the probability of the occurrence of
each of these two events?

4.17 Briefly explain the three approaches to probability. Give one example of each approach.

1 48 Chapter 4 Probability

4.18 Briefly explain for what kind of experiments we use the classical approach to calculate probabilities
of events and for what kind of experiments we use the relative frequency approach.

4.19 Which of the following values cannot be probabilities of events and why?
1/5 .97 -.55 1.56 5/3 0.0 -2/7 1.0

4.20 Which of the following values cannot be probabilities of events and why?
.46 2/3 -.09 1.42 .96 9/4 -1/4 .02

■ APPLICATIONS

4.21 Suppose a randomly selected passenger is about to go through the metal detector at JFK Airport in
New York City. Consider the following two outcomes: The passenger sets off the metal detector, and the
passenger does not set off the metal detector. Are these two outcomes equally likely? Explain why or why
not. If you are to find the probability of these two outcomes, would you use the classical approach or the
relative frequency approach? Explain why.

4.22 Thirty-two persons have applied for a security guard position with a company. Of them, 7 have
previous experience in this area and 25 do not. Suppose one applicant is selected at random. Consider
the following two events: This applicant has previous experience, and this applicant does not have pre-
vious experience. If you are to find the probabilities of these two events, would you use the classical
approach or the relative frequency approach? Explain why.

4.23 The president of a company has a hunch that there is a .80 probability that the company will be suc-
cessful in marketing a new brand of ice cream. Is this a case of classical, relative frequency, or subjective
probability? Explain why.

4.24 The coach of a college football team thinks that there is a .75 probability that the team will win the
national championship this year. Is this a case of classical, relative frequency, or subjective probability?
Explain why.

4.25 A hat contains 40 marbles. Of them, 18 are red and 22 are green. If one marble is randomly selected
out of this hat, what is the probability that this marble is

a. red? b. green?

4.26 A die is rolled once. What is the probability that

a. a number less than 5 is obtained?

b. a number 3 to 6 is obtained?

4.27 A random sample of 2000 adults showed that 1320 of them have shopped at least once on the Inter-
net. What is the (approximate) probability that a randomly selected adult has shopped on the Internet?

4.28 In a statistics class of 42 students, 28 have volunteered for community service in the past. Find the prob-
ability that a randomly selected student from this class has volunteered for community service in the past.

4.29 In a group of 50 car owners, 8 own hybrid cars. If one car owner is selected at random from this
group, what is the probability that this car owner owns a hybrid car?

4.30 Out of the 3000 families who live in a given apartment complex in New York City, 600 paid no in-
come tax last year. What is the probability that a randomly selected family from these 3000 families did
pay income tax last year?

4.31 A multiple-choice question on a test has five answers. If Dianne chooses one answer based on "pure
guess," what is the probability that her answer is

a. correct? b. wrong?

Do these two probabilities add up to 1.0? If yes, why?

4.32 There are 1265 eligible voters in a town, and 972 of them are registered to vote. If one eligible voter
is selected at random, what is the probability that this voter is

a. registered b. not registered?

Do these two probabilities add up to 1.0? If yes, why?

4.33 A company that plans to hire one new employee has prepared a final list of six candidates, all of whom
are equally qualified. Four of these six candidates are women. If the company decides to select at random
one person out of these six candidates, what is the probability that this person will be a woman? What is
the probability that this person will be a man? Do these two probabilities add up to 1.0? If yes, why?

4.34 A sample of 500 large companies showed that 120 of them offer free psychiatric help to their em-
ployees who suffer from psychological problems. If one company is selected at random from this sample,

4.3 Counting Rule

149

what is the probability that this company offers free psychiatric help to its employees who suffer from
psychological problems? What is the probability that this company does not offer free psychiatric help to its
employees who suffer from psychological problems? Do these two probabilities add up to 1.0? If yes, why?

4.35 A sample of 400 large companies showed that 130 of them offer free health fitness centers to their
employees on the company premises. If one company is selected at random from this sample, what is the
probability that this company offers a free health fitness center to its employees on the company prem-
ises? What is the probability that this company does not offer a free health fitness center to its employees
on the company premises? Do these two probabilities add up to 1.0? If yes, why?

4.36 In a large city, 15,000 workers lost their jobs last year. Of them, 7400 lost their jobs because their
companies closed down or moved, 4600 lost their jobs due to insufficient work, and the remainder lost
their jobs because their positions were abolished. If one of these 15,000 workers is selected at random,
find the probability that this worker lost his or her job

a. because the company closed down or moved

b. due to insufficient work

c. because the position was abolished

Do these probabilities add up to 1.0? If so, why?

4.37 A sample of 820 adults showed that 80 of them had no credit cards, 116 had one card each, 94 had
two cards each, 77 had three cards each, 43 had four cards each, and 410 had five or more cards each.
Write the frequency distribution table for the number of credit cards an adult possesses. Calculate the rel-
ative frequencies for all categories. Suppose one adult is randomly selected from these 820 adults. Find
the probability that this adult has

a. three credit cards b. five or more credit cards

4.38 In a sample of 500 families, 70 have a yearly income of less than $40,000, 220 have a yearly in-
come of $40,000 to $80,000, and the remaining families have a yearly income of more than $80,000. Write
the frequency distribution table for this problem. Calculate the relative frequencies for all classes. Sup-
pose one family is randomly selected from these 500 families. Find the probability that this family has a
yearly income of

a. less than $40,000 b. more than $80,000

4.39 Suppose you want to find the (approximate) probability that a randomly selected family from Los
Angeles earns more than $175,000 a year. How would you find this probability? What procedure would
you use? Explain briefly.

4.40 Suppose you have a loaded die and you want to find the (approximate) probabilities of different out-
comes for this die. How would you find these probabilities? What procedure would you use? Explain briefly.

4.3 Counting Rule

The experiments dealt with so far in this chapter have had only a few outcomes, which were easy
to list. However, for experiments with a large number of outcomes, it may not be easy to list all
outcomes. In such cases, we may use the counting rule to find the total number of outcomes.

Counting Rule to Find Total Outcomes If an experiment consists of three steps, and if the first step
can result in m outcomes, the second step in n outcomes, and the third step in k outcomes, then

Total outcomes for the experiment = m • n • k

The counting rule can easily be extended to apply to an experiment that has fewer or more
than three steps.

■ EXAMPLE 4-12

Suppose we toss a coin three times. This experiment has three steps: the first toss, the second
toss, and the third toss. Each step has two outcomes: a head and a tail. Thus,

rule: 3 steps.

Total outcomes for three tosses of a coin = 2X2X2 = 8
The eight outcomes for this experiment are HHH, HHT, HTH, HTT, THH, THT, TTH, and TTT. ■

1 50 Chapter 4 Probability

■ EXAMPLE 4-13

A prospective car buyer can choose between a fixed and a variable interest rate and can also
choose a payment period of 36 months, 48 months, or 60 months. How many total outcomes
are possible?

Solution This experiment is made up of two steps: choosing an interest rate and selecting
a loan payment period. There are two outcomes (a fixed or a variable interest rate) for the first
step and three outcomes (a payment period of 36 months, 48 months, or 60 months) for the
second step. Hence,

Total outcomes = 2X3 = 6 I

■ EXAMPLE 4-14

A National Football League team will play 16 games during a regular season. Each game can
result in one of three outcomes: a win, a loss, or a tie. The total possible outcomes for 16
games are calculated as follows:

Totalo utcomes = 3 , 3*3 , 3 , 3*3 , 3 , 3 , 3*3 , 3 , 3*3 , 3 , 3*3
= 3 16 = 43,046,721

One of the 43,046,721 possible outcomes is all 16 wins. I

4.4 Marginal and Conditional Probabilities

Suppose all 100 employees of a company were asked whether they are in favor of or against
paying high salaries to CEOs of U.S. companies. Table 4.3 gives a two-way classification of
the responses of these 100 employees.

Table 4.3 Two- Way Classification of Employee
Responses

In Favor

Against

Male

Female

Table 4.3 shows the distribution of 100 employees based on two variables or characteris-
tics: gender (male or female) and opinion (in favor or against). Such a table is called a contin-
gency table. In Table 4.3, each box that contains a number is called a cell. Notice that there are
four cells. Each cell gives the frequency for two characteristics. For example, 15 employees in
this group possess two characteristics: "male" and "in favor of paying high salaries to CEOs."
We can interpret the numbers in other cells the same way.

By adding the row totals and the column totals to Table 4.3, we write Table 4.4.

Table 4.4 Two-Way Classification of Employee
Responses with Totals

In Favor

Against

Total

Male

Female

Total

100

Suppose one employee is selected at random from these 100 employees. This employee
may be classified either on the basis of gender alone or on the basis of opinion. If only one
characteristic is considered at a time, the employee selected can be a male, a female, in favor,
or against. The probability of each of these four characteristics or events is called marginal

Applying the counting
rule: 2 steps.

Applying thecounting

rule: 16 steps.

4.4 Marginal and Conditional Probabilities 151

probability or simple probability. These probabilities are called marginal probabilities because
they are calculated by dividing the corresponding row margins (totals for the rows) or column
margins (totals for the columns) by the grand total.

Definition

Marginal Probability Marginal probability is the probability of a single event without consid-
eration of any other event. Marginal probability is also called simple probability.

For Table 4.4, the four marginal probabilities are calculated as follows:

Number of males 60

P(male) = = = .60

Total number of employees 100

As we can observe, the probability that a male will be selected is obtained by dividing the to-
tal of the row labeled "Male" (60) by the grand total (100). Similarly,

P(female) = 40/100 = .40
P(in favor) = 19/100 = .19
P(against) = 81/100 = .81

These four marginal probabilities are shown along the right side and along the bottom of
Table 4.5.

Table 4.5 Listing the Marginal Probabilities

In Favor Against

(A) (B) Total

Male(M) 15 45 60 p (M ) = 60/100 = .60

Female (F) 4 36 40_ P(F) = 40/100 = 40

Total 19 81 100

P(A) = 19/100 P(B) = 81/100

= .19 = .81

Now suppose that one employee is selected at random from these 100 employees. Further-
more, assume it is known that this (selected) employee is a male. In other words, the event that
the employee selected is a male has already occurred. What is the probability that the employee
selected is in favor of paying high salaries to CEOs? This probability is written as follows:

h Read as "given"

P(in favor I male)

The event whose probability * [_ This event has

is to be determined already occurred

This probability, P(in favor I male), is called the conditional probability of "in favor" given
that the event "male" has already happened. It is read as "the probability that the employee se-
lected is in favor given that this employee is a male."

Definition

Conditional Probability Conditional probability is the probability that an event will occur given
that another event has already occurred. If A and B are two events, then the conditional proba-
bility of A given B is written as

P(A | B)

and read as "the probability of A given that B has already occurred.'

152 Chapter 4 Probability

■ EXAMPLE 4-15

Compute the conditional probability P(in favor I male) for the data on 100 employees given

Calculating the conditional , ,

in Table 4.4.

probability: two-way table.

Solution The probability P(in favor I male) is the conditional probability that a randomly
selected employee is in favor given that this employee is a male. It is known that the event
"male" has already occurred. Based on the information that the employee selected is a male,
we can infer that the employee selected must be one of the 60 males and, hence, must belong
to the first row of Table 4.4. Therefore, we are concerned only with the first row of that table.

In Favor

Against

Total

Male

Males who Total number

are in favor of males

The required conditional probability is calculated as follows:

Number of males who are in favor 15

P(in favor male) = = — = .25

Total number of males 60

As we can observe from this computation of conditional probability, the total number of
males (the event that has already occurred) is written in the denominator and the number of
males who are in favor (the event whose probability we are to find) is written in the numera-
tor. Note that we are considering the row of the event that has already occurred. The tree di-
agram in Figure 4.6 illustrates this example.

Figure 4.6 Tree diagram.

Calculating the conditional
probability: two-way table.

■ EXAMPLE 4-16

For the data of Table 4.4, calculate the conditional probability that a randomly selected em-
ployee is a female given that this employee is in favor of paying high salaries to CEOs.

Solution We are to compute the probability P(female I in favor). Because it is known that
the employee selected is in favor of paying high salaries to CEOs, this employee must belong
to the first column (the column labeled "in favor") and must be one of the 19 employees who
are in favor.

In Favor

4 < Females who are in favor

19 < Total number of employees who are in favor

USA TODAY Snapshots®

ROLLING
STOPS

Rolling stops

Men are more likely than women to say they
don't come to a complete stop at stop signs.
How often they only slow down:

Men

Women

, Occasionally
°l f 3%A 30%

Occasionally
41%

Don't
know
1%

Source: Consumer Reports
National Research Center
survey n f i.nnn adnirs

Never
56%

By Anne R. Carey and Alejandro Gonzalez, USA TODAY

The accompanying chart shows the percentages of men and women who only slow down at stop signs
instead of coming to a complete stop. As the percentages in the chart show, when asked whether they only
slow down (instead of coming to a complete stop), 17% of men said often (that is, they often only slow
down and proceed instead of coming to a complete stop), 41% said occasionally (that is, they occasion-
ally only slow down but mostly stop at stop signs), and 42% said never (that is, they never slow down and
proceed but always make a complete stop at stop signs). Among women, the corresponding percentages
are 13%, 30%, and 56%, respectively. These percentages can be written as conditional probabilities as fol-
lows. Suppose one driver is selected at random. Then, given that this driver is a male, the probability is .42
that he makes a complete stop at stop signs. On the other hand, if this selected driver is a female, this
probability is .56. These probabilities can be written as:

^(selected driver makes a complete stop at stop signs | male) = .42

P(selected driver makes a complete stop at stop signs | female) = .56

... Source: USA TODAY, January 30, 2009.

sample survey.

reproduced with permission.

.2105

Hence, the required probability is

Number of females who are in favor 4

P (female in favor) = = —

Total number of employees who are in favor 19

The tree diagram in Figure 4.7 illustrates this example.

Figure 4.7 Tree diagram.

We are to find the
probability of this event

Required probability

This event has
already occurred

153

Chapter 4 Probability

4.5 Mutually Exclusive Events

Events that cannot occur together are called mutually exclusive events. Such events do not
have any common outcomes. If two or more events are mutually exclusive, then at most one of
them will occur every time we repeat the experiment. Thus the occurrence of one event excludes
the occurrence of the other event or events.

Definition

Mutually Exclusive Events Events that cannot occur together are said to be mutually exclusive
events.

For any experiment, the final outcomes are always mutually exclusive because one and only
one of these outcomes is expected to occur in one repetition of the experiment. For example,
consider tossing a coin twice. This experiment has four outcomes: HH, HT, TH, and TT. These
outcomes are mutually exclusive because one and only one of them will occur when we toss
this coin twice.

■ EXAMPLE 4-17

Consider the following events for one roll of a die:

A = an even number is observed = {2, 4, 6}
B = an odd number is observed = {1, 3, 5}
C = a number less than 5 is observed = {1,2, 3, 4}

Are events A and B mutually exclusive? Are events A and C mutually exclusive?

Solution Figures 4.8 and 4.9 show the diagrams of events A and B and events A and C, re-
spectively.

Illustrating mutually
exclusive and mutually
nonexclusive events.

Figure 4.8 Mutually exclusive Figure 4.9 Mutually nonex-

events A and B. elusive events A and C.

As we can observe from the definitions of events A and B and from Figure 4.8, events A
and B have no common element. For one roll of a die, only one of the two events A and B
can happen. Hence, these are two mutually exclusive events. We can observe from the defini-
tions of events A and C and from Figure 4.9 that events A and C have two common outcomes:
2-spot and 4-spot. Thus, if we roll a die and obtain either a 2-spot or a 4-spot, then A and C
happen at the same time. Hence, events A and C are not mutually exclusive. H

■ EXAMPLE 4-18

Consider the following two events for a randomly selected adult:

Illustrating mutually

exclusive events. Y = this adult has shopped on the Internet at least once

N = this adult has never shopped on the Internet

Are events Y and N mutually exclusive?

4.6 Independent Versus Dependent Events 155

Solution Note that event Y consists of all adults who have shopped on the Internet at least
once, and event N includes all adults who have never shopped on the Internet. These two events
are illustrated in the Venn diagram in Figure 4.10.

As we can observe from the definitions of events Y and N and from Figure 4.10, events Y
and N have no common outcome. They represent two distinct sets of adults: the ones who have
shopped on the Internet at least once and the ones who have never shopped on the Internet.
Hence, these two events are mutually exclusive. M

4.6 Independent Versus Dependent Events

In the case of two independent events, the occurrence of one event does not change the prob-
ability of the occurrence of the other event.

Definition

Independent Events Two events are said to be independent if the occurrence of one does not
affect the probability of the occurrence of the other. In other words, A and B are independent
events if

either P(A | B) = P(A) or P(B \ A) = P(B)

It can be shown that if one of these two conditions is true, then the second will also be true,
and if one is not true, then the second will also not be true.

If the occurrence of one event affects the probability of the occurrence of the other event,
then the two events are said to be dependent events. In probability notation, the two events are
dependent if either P(A | B) + P(A) or P(B \ A) + P{B).

■ EXAMPLE 4-19

Refer to the information on 100 employees given in Table 4.4 in Section 4.4. Are events
"female (F)" and "in favor (A)" independent?

Solution Events F and A will be independent if

P(F) = P(F | A)

Otherwise they will be dependent.

Using the information given in Table 4.4, we compute the following two probabilities:

P(F) = 40/100 = .40 and P(F\A) = 4/19 = .2105

Because these two probabilities are not equal, the two events are dependent. Here, depend-
ence of events means that the respective percentages of males who are in favor of and against
paying high salaries to CEOs are different from the respective percentages of females who are
in favor and against.

In this example, the dependence of A and F can also be proved by showing that the prob-
abilities P(A) and P(A | F) are not equal. I

Illustrating two dependent
events: two-way table.

156 Chapter 4 Probability

■ EXAMPLE 4-20

A box contains a total of 100 CDs that were manufactured on two machines. Of them, 60 were
manufactured on Machine I. Of the total CDs, 15 are defective. Of the 60 CDs that were man-
ufactured on Machine I, 9 are defective. Let D be the event that a randomly selected CD is
defective, and let A be the event that a randomly selected CD was manufactured on Machine
I. Are events D and A independent?

independent events.

Solution From the given information,

P{D) = 15/100 = .15 and P{D \ A) = 9/60 = .15

Hence,

P(D) = P(D | A)

Consequently, the two events, D and A, are independent.

Independence, in this example, means that the probability of any CD being defective is the
same, .15, irrespective of the machine on which it is manufactured. In other words, the two
machines are producing the same percentage of defective CDs. For example, 9 of the 60 CDs
manufactured on Machine I are defective, and 6 of the 40 CDs manufactured on Machine II
are defective. Thus, for each of the two machines, 15% of the CDs produced are defective.

Using the given information, we can prepare Table 4.6. The numbers in the shaded cells are
given to us. The remaining numbers are calculated by doing some arithmetic manipulations.

Table 4.6 Two-Way Classification Table

Defective

Good

(D)

(G)

Total

Machine I (A)

Machine II (fl)

Total

[ioo1

Using this table, we can find the following probabilities:

P{D) = 15/100 = .15
P{D\A) = 9/60 = .15

Because these two probabilities are the same, the two events are independent. I

Two Important Observations ► We can make the following two important observations about mutually exclusive, independent,

and dependent events.

1. Two events are either mutually exclusive or independent. 2

a. Mutually exclusive events are always dependent.

b. Independent events are never mutually exclusive.

2. Dependent events may or may not be mutually exclusive.

4.7 Complementary Events

Two mutually exclusive events that taken together include all the outcomes for an experiment
are called complementary events. Note that two complementary events are always mutually
exclusive.

2 The exception to this rule occurs when at least one of the two events has a zero probability.

4.7 Complementary Events 157

Definition

Complementary Events The complement of event A, denoted by A and read as "A bar" or "A
complement," is the event that includes all the outcomes for an experiment that are not in A.

Events A and A are complements of each other. The Venn diagram in Figure 4.11 shows
the complementary events A and A.

Figure 4.T 1 Venn diagram of two complementary events.

Because two complementary events, taken together, include all the outcomes for an exper-
iment and because the sum of the probabilities of all outcomes is 1, it is obvious that

P(A) + P(A) = 1

From this equation, we can deduce that

P(A) = 1 - P(A) and P(A) = 1 - P(A)

Thus, if we know the probability of an event, we can find the probability of its comple-
mentary event by subtracting the given probability from 1 .

■ EXAMPLE 4-21

In a group of 2000 taxpayers, 400 have been audited by the IRS at least once. If one taxpayer
is randomly selected from this group, what are the two complementary events for this exper-
iment, and what are their probabilities?

Calculating probabilities of
complementary events.

Solution The two complementary events for this experiment are

A = the selected taxpayer has been audited by the IRS at least once

A = the selected taxpayer has never been audited by the IRS

Note that here event A includes the 400 taxpayers who have been audited by the IRS at least
once, and A includes the 1600 taxpayers who have never been audited by the IRS. Hence, the
probabilities of events A and A are

P(A) = 400/2000 = .20 and P(A) = 1600/2000 = .80

As we can observe, the sum of these two probabilities is one. Figure 4.12 shows a Venn diagram
for this example.

Figure 4.12 Venn diagram.

158 Chapter 4 Probability

Calculating probabilities of
complementary events.

■ EXAMPLE 4-22

In a group of 5000 adults, 3500 are in favor of stricter gun control laws, 1200 are against such
laws, and 300 have no opinion. One adult is randomly selected from this group. Let A be the
event that this adult is in favor of stricter gun control laws. What is the complementary event
of A? What are the probabilities of the two events?

Solution The two complementary events for this experiment are

A = the selected adult is in favor of stricter gun control laws

A = the selected adult is either against such laws or has no opinion

Note that here event A includes 1500 adults who are either against stricter gun control laws
or have no opinion. Also notice that events A and A are complements of each other. Because
3500 adults in the group favor stricter gun control laws and 1500 either are against stricter
gun control laws or have no opinion, the probabilities of events A and A are

P(A) = 3500/5000 = .70 and P(A) = 1500/5000 = .30

As we can observe, the sum of these two probabilities is 1 . Also, once we find P(A), we can
find the probability of P(A) as

P(A) = 1 - P(A) = 1 - .70 = .30

Figure 4.13 shows a Venn diagram for this example.

Figure 4.13 Venn diagram.

EXERCISES

CONCEPTS AND PROCEDURES

4.41 Briefly explain the difference between the marginal and conditional probabilities of events. Give one
example of each.

4.42 What is meant by two mutually exclusive events? Give one example of two mutually exclusive events
and another example of two mutually nonexclusive events.

4.43 Briefly explain the meaning of independent and dependent events. Suppose A and B are two events.
What formula can you use to prove whether A and B are independent or dependent?

4.44 What is the complement of an event? What is the sum of the probabilities of two complementary
events?

4.45 How many different outcomes are possible for four rolls of a die?

4.46 How many different outcomes are possible for 10 tosses of a coin?

4.47 A statistical experiment has eight equally likely outcomes that are denoted by 1,2, 3, 4, 5, 6, 7, and
8. Let event A = {2, 5, 7} and event B = {2, 4, 8}.

a. Are events A and B mutually exclusive events?

b. Are events A and B independent events?

c. What are the complements of events A and B, respectively, and their probabilities?

4.48 A statistical experiment has 10 equally likely outcomes that are denoted by 1, 2, 3, 4, 5, 6, 7, 8, 9,
and 10. Let event A = {3, 4, 6, 9} and event B = { 1, 2, 5).

a. Are events A and B mutually exclusive events?

b. Are events A and B independent events?

c. What are the complements of events A and S, respectively, and their probabilities?

4.7 Complementary Events 159

■ APPLICATIONS

4.49 A small ice cream shop has 10 flavors of ice cream and 5 kinds of toppings for its sundaes. How
many different selections of one flavor of ice cream and one kind of topping are possible?

4.50 A man just bought 4 suits, 8 shirts, and 12 ties. All of these suits, shirts, and ties coordinate with
each other. If he is to randomly select one suit, one shirt, and one tie to wear on a certain day, how many
different outcomes (selections) are possible?

4.51 A restaurant menu has four kinds of soups, eight kinds of main courses, five kinds of desserts, and
six kinds of drinks. If a customer randomly selects one item from each of these four categories, how many
different outcomes are possible?

4.52 A student is to select three classes for next semester. If this student decides to randomly select one
course from each of eight economics classes, six mathematics classes, and five computer classes, how
many different outcomes are possible?

4.53 Two thousand randomly selected adults were asked whether or not they have ever shopped on the
Internet. The following table gives a two-way classification of the responses.

Have Shopped

Have Never Shopped

Male

500

700

Female

300

500

a. If one adult is selected at random from these 2000 adults, find the probability that this adult

i. has never shopped on the Internet

ii. is a male

Hi. has shopped on the Internet given that this adult is a female
iv. is a male given that this adult has never shopped on the Internet

b. Are the events "male" and "female" mutually exclusive? What about the events "have shopped"
and "male?" Why or why not?

c. Are the events "female" and "have shopped" independent? Why or why not?

4.54 According to a March 2009 Gallup Poll (http://www.gallup.com/poll/117025/Support-Nuclear-
Energy-Inches-New-High.aspx), 71% of Republicans/Republican leaners and 52% of Democrats/Democrat
leaners favor the use of nuclear power. The survey consisted of 1012 American adults, approximately half
of whom were Republicans or Republican leaners. Suppose the following table gives the distribution of
responses of these 1012 adults.

Favor

Do not favor

Republicans/Republican leaners

381

128

Democrats/Democrat leaners

258

245

a. If one person is selected at random from this sample of 1012 U.S. adults, find the probability that
this person

i. does not favor the use of nuclear power

ii. is a Republican/Republican leaner

iii. favors the use of nuclear power given that the person is a Republican/Republican leaner

iv. is a Republican/Republican leaner given that the person does not favor the use of nuclear
power

b. Are the events favors and does not favor mutually exclusive? What about the events does not
favor and Republican/Republican leaner!

c. Are the events does not favor and Republican/Republican leaner independent? Why or why not?

4.55 Two thousand randomly selected adults were asked if they are in favor of or against cloning. The
following table gives the responses.

In Favor

Against

No Opinion

Male

395

405

100

Female

300

680

120

1 60 Chapter 4 Probability

a. If one person is selected at random from these 2000 adults, find the probability that this person is

i. in favor of cloning

ii. against cloning

iii. in favor of cloning given the person is a female

iv. a male given the person has no opinion

b. Are the events "male" and "in favor" mutually exclusive? What about the events "in favor" and
"against?" Why or why not?

c. Are the events "female" and "no opinion" independent? Why or why not?

4.56 Five hundred employees were selected from a city's large private companies, and they were asked
whether or not they have any retirement benefits provided by their companies. Based on this information,
the following two-way classification table was prepared.

Have Retirement Benefits

Yes No

Men

225 75

Women

150 50

a. If one employee is selected at random from these 500 employees, find the probability that this
employee

i. is a woman

ii. has retirement benefits

iii. has retirement benefits given the employee is a man

iv. is a woman given that she does not have retirement benefits

b. Are the events "man" and "yes" mutually exclusive? What about the events "yes" and "no?" Why
or why not?

c. Are the events "woman" and "yes" independent? Why or why not?

4.57 A consumer agency randomly selected 1700 flights for two major airlines, A and B. The following
table gives the two-way classification of these flights based on airline and arrival time. Note that "less than
30 minutes late" includes flights that arrived early or on time.

Less Than 30

30 Minutes to

More Than

Minutes Late

1 Hour Late

Airline A

429

390

Airline B

393

316

a. If one flight is selected at random from these 1700 flights, find the probability that this flight is

i. more than 1 hour late

ii. less than 30 minutes late

iii. a flight on airline A given that it is 30 minutes to 1 hour late

iv. more than 1 hour late given that it is a flight on airline B

b. Are the events "airline A" and "more than 1 hour late" mutually exclusive? What about the events
"less than 30 minutes late" and "more than 1 hour late?" Why or why not?

c. Are the events "airline B" and "30 minutes to 1 hour late" independent? Why or why not?

4.58 Two thousand randomly selected adults were asked if they think they are financially better off than
their parents. The following table gives the two-way classification of the responses based on the educa-
tion levels of the persons included in the survey and whether they are financially better off, the same as,
or worse off than their parents.

Less Than

High

More Than

High School

School

High School

Better off

140

450

420

Same as

250

110

Worse off

200

300

4.8 Intersection of Events and the Multiplication Rule

161

a. If one adult is selected at random from these 2000 adults, find the probability that this adult is

i. financially better off than his/her parents

ii. financially better off than his/her parents given he/she has less than high school education

iii. financially worse off than his/her parents given he/she has high school education

iv. financially the same as his/her parents given he/she has more than high school education

b. Are the events "better off" and "high school" mutually exclusive? What about the events "less
than high school" and "more than high school?" Why or why not?

c. Are the events "worse off" and "more than high school" independent? Why or why not?

4.59 There are a total of 160 practicing physicians in a city. Of them, 75 are female and 25 are pediatri-
cians. Of the 75 females, 20 are pediatricians. Are the events "female" and "pediatrician" independent?
Are they mutually exclusive? Explain why or why not.

4.60 Of a total of 100 CDs manufactured on two machines, 20 are defective. Sixty of the total CDs were
manufactured on Machine I, and 10 of these 60 are defective. Are the events "machine type" and "defec-
tive CDs" independent? (Note: Compare this exercise with Example 4-20.)

4.61 A company hired 30 new college graduates last week. Of these, 16 are female and 11 are business
majors. Of the 16 females, 9 are business majors. Are the events "female" and "business major" independ-
ent? Are they mutually exclusive? Explain why or why not.

4.62 Define the following two events for two tosses of a coin:

A = at least one head is obtained
B = both tails are obtained

a. Are A and B mutually exclusive events? Are they independent? Explain why or why not.

b. Are A and B complementary events? If yes, first calculate the probability of B and then calculate
the probability of A using the complementary event rule.

4.63 Let A be the event that a number less than 3 is obtained if we roll a die once. What is the probabil-
ity of A? What is the complementary event of A, and what is its probability?

4.64 According to a 2007 America's Families and Living Arrangements Census Bureau survey, 52.1 mil-
lion children lived with both of their parents in the same household, whereas 21.6 million lived with at
most one parent in the household. Assume that all U.S. children are included in this survey and that this
information is true for the current population. If one child is selected at random, what are the two com-
plementary events and their probabilities?

4.65 The probability that a randomly selected college student attended at least one major league base-
ball game last year is .12. What is the complementary event? What is the probability of this complementary
event?

4.8 Intersection of Events

and the Multiplication Rule

This section discusses the intersection of two events and the application of the multiplication
rule to compute the probability of the intersection of events.

4.8.1 Intersection of Events

The intersection of two events is given by the outcomes that are common to both events.
Definition

Intersection of Events Let A and B be two events defined in a sample space. The intersection
of A and B represents the collection of all outcomes that are common to both A and B and is
denoted by

A andB

162 Chapter 4 Probability

The intersection of events A and B is also denoted by either AflBor AB. Let

A = event that a family owns a DVD player
B = event that a family owns a digital camera

Figure 4.14 illustrates the intersection of events A and B. The shaded area in this figure gives
the intersection of events A and B, and it includes all the families who own both a DVD player
and a digital camera.

Figure 4.14 Intersection of events A and B.

Intersection of A and B

4.8.2 Multiplication Rule

Sometimes we may need to find the probability of two or more events happening together.

Definition

Joint Probability

The probability of the intersection of two events is called their joint probability.

It is written as

P(A and B)

The probability of the intersection of two events is obtained by multiplying the marginal
probability of one event by the conditional probability of the second event. This rule is called
the multiplication rule.

Multiplication Rule to Find Joint Probability The probability of the intersection of two events A
and B is

P(A and B) = P{A) P(B | A)
The joint probability of events A and B can also be denoted by P(A D B) or P(AB).

■ EXAMPLE 4-23

Table 4.7 gives the classification of all employees of a company by gender and college degree.

Calculating the joint probability ° r J r J J ° o o

of two events: two-way table. Table 4.7 Classification of Employees by Gender and Education

College

Not a College

Graduate

(G)

(N)

Total

Male (M)

Female (F)

Total

If one of these employees is selected at random for membership on the employee-management
committee, what is the probability that this employee is a female and a college graduate?

Solution We are to calculate the probability of the intersection of the events "female" (denoted by
F) and "college graduate" (denoted by G). This probability may be computed using the formula

P(F and G) = P{F) P(G | F)

4.8 Intersection of Events and the Multiplication Rule

The shaded area in Figure 4.15 shows the intersection of the events "female" and "college
graduate."

College Figure 4.15 Intersection of events

Females graduates F and G.

Females and college
graduates

Notice that there are 13 females among 40 employees. Hence, the probability that a female
is selected is

P(F) = 13/40

To calculate the probability P(G I F), we know that F has already occurred. Consequently, the
employee selected is one of the 13 females. In the table, there are 4 college graduates among
13 female employees. Hence, the conditional probability of G given F is

P{G | F) = 4/13

The joint probability of F and G is

P(F and G) = P(F) P{G \ F) = (13/40)(4/13) = .100

Thus, the probability is .100 that a randomly selected employee is a female and a college graduate.

The probability in this example can also be calculated without using the multiplication rule.
As we can notice from Figure 4.15 and from the table, 4 employees out of a total of 40 are
female and college graduates. Hence, if any of these 4 employees is selected, the events "fe-
male" and "college graduate" both happen. Therefore, the required probability is

P(F and G) = 4/40 = .100 ■

We can compute three other joint probabilities for the table in Example 4-23 as follows:

P(M and G) = P(M) P(G \ M) = (27/40)(7/27) = .175
P(MandA0 = P{M) P{N \ M) = (27/40)(20/27) = .500
P{F and N) = P(F) P(N \ F) = (13/40)(9/13) = .225

The tree diagram in Figure 4.16 shows all four joint probabilities for this example. The joint
probability of F and G is highlighted.

1 64 Chapter 4 Probability

■ EXAMPLE 4-24

A box contains 20 DVDs, 4 of which are defective. If two DVDs are selected at random (with-
out replacement) from this box, what is the probability that both are defective?

Solution Let us define the following events for this experiment:

G x = event that the first DVD selected is good
D l = event that the first DVD selected is defective
G 2 = event that the second DVD selected is good
D 2 = event that the second DVD selected is defective

We are to calculate the joint probability of D x and D 2 , which is given by

P(D 1 and D 2 ) = P(D l )P(D 2 1 D\)

As we know, there are 4 defective DVDs in 20. Consequently, the probability of selecting
a defective DVD at the first selection is

P(D X ) = A/20

To calculate the probability P(D 2 \ D x ), we know that the first DVD selected is defective be-
cause D x has already occurred. Because the selections are made without replacement, there
are 19 total DVDs, and 3 of them are defective at the time of the second selection. Therefore,

P{D 2 1 £>,) = 3/19

Hence, the required probability is

P{D { and£> 2 ) = P{D { ) P{D 2 \D X ) = (4/20)(3/19) = .0316

The tree diagram in Figure 4.17 shows the selection procedure and the final four outcomes
for this experiment along with their probabilities. The joint probability of D x and D 2 is high-
lighted in the tree diagram.

Calculating the joint probability
of two events.

Conditional probability was discussed in Section 4.4. It is obvious from the formula for
joint probability that if we know the probability of an event A and the joint probability of events
A and B, then we can calculate the conditional probability of B given A.

4.8 Intersection of Events and the Multiplication Rule 165

Calculating Conditional Probability If A and B are two events, then,

P(A and B) P(A and B)

P(B\A) = , , and P(A I B) = , '
V 1 ' P(A) y 1 ; P(B)

given that P(A) * and P(B) # 0.

EXAMPLE 4-25

The probability that a randomly selected student from a college is a senior is .20, and the joint „ , , .. ., .... ,

r j j o 'J Calculating the conditional

probability that the student is a computer science major and a senior is .03. Find the condi- probability of an event
tional probability that a student selected at random is a computer science major given that the
student is a senior.

Solution Let us define the following two events:

A = the student selected is a senior

B = the student selected is a computer science major

From the given information,

P(A) = .20 and P(A and B) = .03

Hence,

P(A and B) 03
P(B\A)= ^ — ' = — = .15
V 1 ; P(A) .20

Thus, the (conditional) probability is .15 that a student selected at random is a computer
science major given that he or she is a senior. H

Multiplication Rule for Independent Events

The foregoing discussion of the multiplication rule was based on the assumption that the two
events are dependent. Now suppose that events A and B are independent. Then,

P(A) = P(A | B) and P(B) = P(B | A)

By substituting P(B) for P(B | A) into the formula for the joint probability of A and B, we
obtain

P(A and B) = P(A) P(B)

Multiplication Rule to Calculate the Probability of Independent Events The probability of the
intersection of two independent events A and B is

P(A and B) = P(A) P(B)

of two independent events.

■ EXAMPLE 4-26

An office building has two fire detectors. The probability is .02 that any fire detector of this

type will fail to go off during a fire. Find the probability that both of these fire detectors will Calculating foe joint probability
fail to go off in case of a fire.

Solution In this example, the two fire detectors are independent because whether or not one
fire detector goes off during a fire has no effect on the second fire detector. We define the fol-
lowing two events:

A = the first fire detector fails to go off during a fire
B = the second fire detector fails to go off during a fire

166 Chapter 4 Probability

Then, the joint probability of A and B is

P(A and B) = P(A) P(B) = (.02)(.02) = .0004 ■

The multiplication rule can be extended to calculate the joint probability of more than two events.
Example 4-27 illustrates such a case for independent events.

Calculating the joint probability

to three patients

oj three events. r

I EXAMPLE 4-27

The probability that a patient is allergic to penicillin is .20. Suppose this drug is administered

(a) Find the probability that all three of them are allergic to it.

(b) Find the probability that at least one of them is not allergic to it.

Solution

(a) Let A, B, and C denote the events that the first, second, and third patients, respec-
tively, are allergic to penicillin. We are to find the joint probability of A, B, and C.
All three events are independent because whether or not one patient is allergic does
not depend on whether or not any of the other patients is allergic. Hence,

P(A and B and C) = P(A) P(B) P(C) = (.20)(.20)(.20) = .008

The tree diagram in Figure 4.18 shows all the outcomes for this experiment. Events
A, B, and C are the complementary events of A, B, and C, respectively. They repre-
sent the events that the patients are not allergic to penicillin. Note that the intersec-
tion of events A, B, and C is written as ABC in the tree diagram.

First patient

Third patient

Final outcomes
P(ABC) = .008

P(ABC) = .032
P(ABC) = .032

P(ABC) = .128
P(ABC) = .032

P(ABC) = .128
P(ABC) = .128

P(ABC) = .512

Figure 4.18 Tree diagram for joint probabilities.

Going "0 for July," as former infielder Bob Aspromonte once put it, is enough to make a baseball
player toss out his lucky bat or start seriously searching for flaws in his hitting technique. But the
culprit is usually just simple mathematics.

Statistician Harry Roberts of the University of Chicago's Graduate School of Business studied
the records of major-league baseball players and found that a batter is no more likely to hit worse
when he is in a slump than when he is in a hot streak. The occurrences of hits followed the same
pattern as purely random events such as pulling marbles out of a hat. If there were one white
marble and three black ones in the hat, for example, then a white marble would come out about
one quarter of the time— a .250 average. In the same way, a player who hits .250 will in the long
run get a hit every four times at bat.

But that doesn't mean the player will hit the ball exactly every fourth time he comes to the
plate— just as it's unlikely that the white marble will come out exactly every fourth time.

Even a batter who goes hitless 10 times in a row might safely be able to pin the blame on
statistical fluctuations. The odds of pulling a black marble out of a hat 10 times in a row are about
6 percent— not a frequent occurrence, but not impossible, either. Only in the long run do these
statistical fluctuations even out.

As mentioned in the above excerpt from U.S. News & World Report, if we assume a player hits .250
in the long run, the probability that this player does not hit during a specific trip to the plate is .75. Hence,
we can calculate the probability that he goes hitless 10 times in a row as follows:

P(hitless 1 times in a row) = (.75)(.75) ■ ■ ■ (.75) ten times

= (.75) 10 = .0563

Note that each trip to the plate is independent, and the probability that a player goes hitless 10 times in
a row is given by the intersection of 10 hitless trips. This probability has been rounded off to "about 6°/o"
in this illustration.

(b) Let us define the following events:

G = all three patients are allergic
H = at least one patient is not allergic

Events G and H are two complementary events. Event G consists of the intersection
of events A, B, and C. Hence, from part (a),

P(G) = P(A and B and C) = .008

Therefore, using the complementary event rule, we obtain

P(H) = 1 - P(G) = 1 - .008 = .992 ■

Case Study 4-2 calculates the probability of a hitless streak in baseball by using the mul-
tiplication rule.

Joint Probability of Mutually Exclusive Events

We know from an earlier discussion that two mutually exclusive events cannot happen together.
Consequently, their joint probability is zero.

Joint Probability of Mutually Exclusive Events The joint probability of two mutually exclusive
events is always zero. If A and B are two mutually exclusive events, then,

P(A and B) =

BASEBALL

PLAYERS

HAVE

"SLUMPS"

AND

"STREAKS"

Source: U.S. News & World Report, July
1 1, 1988, p. 46. Copyright © 1988, by
U.S. News & World Report, Inc. Excerpts
reprinted with permission.

168 Chapter 4 Probability

Illustrating the probability of
two mutually exclusive events.

■ EXAMPLE 4-28

Consider the following two events for an application filed by a person to obtain a car loan:

A = event that the loan application is approved
R = event that the loan application is rejected

What is the joint probability of A and R?

Solution The two events A and R are mutually exclusive. Either the loan application will
be approved or it will be rejected. Hence,

P{A and R) = ■

EXERCISES

CONCEPTS AND PROCEDURES

4.66 Explain the meaning of the intersection of two events. Give one example.

4.67 What is meant by the joint probability of two or more events? Give one example.

4.68 How is the multiplication rule of probability for two dependent events different from the rule for two
independent events?

4.69 What is the joint probability of two mutually exclusive events? Give one example.

4.70 Find the joint probability of A and B for the following.

a. P(A) = .40 and P(B | A) = .25

b. P(B) = .65 and P(A j B) = .36

4.71 Find the joint probability of A and B for the following.

a. P(B) = .59 and P(A | B) = .77

b. P(A) = .28 and P(B j A) = .35

4.72 Given that A and B are two independent events, find their joint probability for the following.

a. P(A) = .61 and P(B) = .27

b. P(A) = .39 and P(B) = .63

4.73 Given that A and B are two independent events, find their joint probability for the following.

a. P(A) = .20 and P(B) = .76

b. P(A) = .57 and P(B) = .32

4.74 Given that A, B, and C are three independent events, find their joint probability for the following.

a. P(A) = .20, P(B) = .46, and P(C) = .25

b. P(A) = .44, P(B) = .27, and P(C) = .43

4.75 Given that A, B, and C are three independent events, find their joint probability for the following.

a. P(A) = .49, P(B) = .67, and P(C) = .75

b. P(A) = .71, P(B) = .34, and P(C) = .45

4.76 Given that P(A) = .30 and P(A and B) = .24, find P(B \ A).

4.77 Given that P(B) = .65 and P(A and B) = .45, find P(A \ B).

4.78 Given that P(A | B) = .40 and P(A and B) = .36, find P{B).

4.79 Given that P(B \ A) = .80 and P(A and B) = .58, find P(A).

■ APPLICATIONS

4.80 In a sample survey, 1 800 senior citizens were asked whether or not they have ever been victimized
by a dishonest telemarketer. The following table gives the responses by age group (in years).

Have Been
Victimized

Have Never
Been Victimized

60-69 (A)

106

698

Age

70-79 (B)

145

447

80 or over (C)

343

4.8 Intersection of Events and the Multiplication Rule 169

a. Suppose one person is randomly selected from these senior citizens. Find the following probabilities.

i. P(have been victimized and C)

ii. P(have never been victimized and A)

b. Find P(B and C). Is this probability zero? Explain why or why not.

4.81 The following table gives a two-way classification of all basketball players at a state university who
began their college careers between 2001 and 2005, based on gender and whether or not they graduated.

Graduated

Did Not Graduate

Male

126

Female

133

a. If one of these players is selected at random, find the following probabilities.

i. /'(female and graduated)

ii. /'(male and did not graduate)

b. Find P(graduated and did not graduate). Is this probability zero? If yes, why?

4.82 Five hundred employees were selected from a city's large private companies and asked whether or
not they have any retirement benefits provided by their companies. Based on this information, the follow-
ing two-way classification table was prepared.

Have Retirement Benefits

Yes No

Men

225 75

Women

150 50

a. Suppose one employee is selected at random from these 500 employees. Find the following prob-
abilities.

i. Probability of the intersection of events "woman" and "yes"

ii. Probability of the intersection of events "no" and "man"

b. Mention what other joint probabilities you can calculate for this table and then find them. You
may draw a tree diagram to find these probabilities.

4.83 Two thousand randomly selected adults were asked whether or not they have ever shopped on the
Internet. The following table gives a two-way classification of the responses obtained.

Have Shopped

Have Never Shopped

Male

500

700

Female

300

500

a. Suppose one adult is selected at random from these 2000 adults. Find the following probabilities.

i. /'(has never shopped on the Internet and is a male)

ii. P(has shopped on the Internet and is a female)

b. Mention what other joint probabilities you can calculate for this table and then find those. You
may draw a tree diagram to find these probabilities.

4.84 A consumer agency randomly selected 1700 flights for two major airlines, A and B. The following
table gives the two-way classification of these flights based on airline and arrival time. Note that "less than
30 minutes late" includes flights that arrived early or on time.

Less Than 30

30 Minutes to

More Than

Minutes Late

1 Hour Late

Airline A

429

390

Airline B

393

316

a. Suppose one flight is selected at random from these 1700 flights. Find the following probabilities.

i. /'(more than 1 hour late and airline A)

ii. /'(airline B and less than 30 minutes late)

b. Find the joint probability of events "30 minutes to 1 hour late" and "more than 1 hour late." Is
this probability zero? Explain why or why not.

1 70 Chapter 4 Probability

4.85 Two thousand randomly selected adults were asked if they think they are financially better off than
their parents. The following table gives the two-way classification of the responses based on the educa-
tion levels of the persons included in the survey and whether they are financially better off, the same as,
or worse off than their parents.

Less Than

High

More Than

High School

School

High School

Better off

140

450

420

Same as

250

110

Worse off

200

300

a. Suppose one adult is selected at random from these 2000 adults. Find the following probabilities.

i. P(better off and high school)

ii. P(more than high school and worse off)

b. Find the joint probability of the events "worse off" and "better off." Is this probability zero? Ex-
plain why or why not.

4.86 In a statistics class of 42 students, 28 have volunteered for community service in the past. If two stu-
dents are selected at random from this class, what is the probability that both of them have volunteered
for community service in the past? Draw a tree diagram for this problem.

4.87 In a political science class of 35 students, 21 favor abolishing the electoral college and thus electing
the President of the United States by popular vote. If two students are selected at random from this class,
what is the probability that both of them favor abolition of the electoral college? Draw a tree diagram for
this problem.

4.88 A company is to hire two new employees. They have prepared a final list of eight candidates, all of
whom are equally qualified. Of these eight candidates, five are women. If the company decides to select
two persons randomly from these eight candidates, what is the probability that both of them are women?
Draw a tree diagram for this problem.

4.89 In a group of 10 persons, 4 have a type A personality and 6 have a type B personality. If two persons
are selected at random from this group, what is the probability that the first of them has a type A per-
sonality and the second has a type B personality? Draw a tree diagram for this problem.

4.90 The probability is .80 that a senior from a large college in New York State has never gone to Florida
for spring break. If two college seniors are selected at random from this college, what is the probability
that the first has never gone to Florida for spring break and the second has? Draw a tree diagram for this
problem.

4.91 The probability that a student graduating from Suburban State University has student loans to pay
off after graduation is .60. If two students are randomly selected from this university, what is the proba-
bility that neither of them has student loans to pay off after graduation?

4.92 A contractor has submitted bids for two state construction projects. The probability of winning each
contract is .25, and it is the same for both contracts.

a. What is the probability that he will win both contracts?

b. What is the probability that he will win neither contract?

Draw a tree diagram for this problem.

4.93 Five percent of all items sold by a mail-order company are returned by customers for a refund. Find
the probability that, of two items sold during a given hour by this company,

a. both will be returned for a refund

b. neither will be returned for a refund

Draw a tree diagram for this problem.

4.94 The probability that any given person is allergic to a certain drug is .03. What is the probability that
none of three randomly selected persons is allergic to this drug? Assume that all three persons are inde-
pendent.

4.95 The probability that a farmer is in debt is .80. What is the probability that three randomly selected
farmers are all in debt? Assume independence of events.

4.96 The probability that a student graduating from Suburban State University has student loans to pay
off after graduation is .60. The probability that a student graduating from this university has student loans

4.9 Union of Events and the Addition Rule

to pay off after graduation and is a male is .24. Find the conditional probability that a randomly selected
student from this university is a male given that this student has student loans to pay off after graduation.

4.97 The probability that an employee at a company is a female is .36. The probability that an employee
is a female and married is .19. Find the conditional probability that a randomly selected employee from
this company is married given that she is a female.

4.98 A telephone poll conducted of 1000 adult Americans for the Washington Post in March 2009 asked
about current events in the United States. Suppose that of the 1000 respondents, 629 stated that they were
cutting back on their daily spending. Suppose that 322 of the 629 people who stated that they were cut-
ting back on their daily spending said that they were cutting back "somewhat" and 97 stated that they were
cutting back "somewhat" and delaying the purchase of a new car by at least 6 months. If one of the 629
people who are cutting back on their spending is selected at random, what is the probability that he/she
is delaying the purchase of a new car by at least 6 months given that he/she is cutting back on spending
"somewhat?"

4.99 Suppose that 20% of all adults in a small town live alone, and 8% of the adults live alone and have
at least one pet. What is the probability that a randomly selected adult from this town has at least one pet
given that this adult lives alone?

4.9 Union of Events and the Addition Rule

This section discusses the union of events and the addition rule that is applied to compute the
probability of the union of events.

4.9.1 Union of Events

The union of two events A and B includes all outcomes that are either in A or in B or in both
A and B.

Definition

Union of Events Let A and B be two events defined in a sample space. The union of events A
and B is the collection of all outcomes that belong either to A or to B or to both A and B and is
denoted by

A orB

The union of events A and B is also denoted by AUB. Example 4-29 illustrates the union
of events A and B.

■ EXAMPLE 4-29

A senior citizens center has 300 members. Of them, 140 are male, 210 take at least one
medicine on a permanent basis, and 95 are male and take at least one medicine on a per-
manent basis. Describe the union of the events "male" and "take at least one medicine on a
permanent basis."

Solution Let us define the following events:

M = a senior citizen is a male
F = a senior citizen is a female
A = a senior citizen takes at least one medicine
B = a senior citizen does not take any medicine

The union of the events "male" and "take at least one medicine" includes those senior citizens
who are either male or take at least one medicine or both. The number of such senior citizens is

140 + 210 - 95 = 255

Illustrating the union
of two events.

172 Chapter 4 Probability

Why did we subtract 95 from the sum of 140 and 210? The reason is that 95 senior citizens
(which represent the intersection of events M and A) are common to both events M and A
and, hence, are counted twice. To avoid double counting, we subtracted 95 from the sum of
the other two numbers. We can observe this double counting from Table 4.8, which is con-
structed using the given information. The sum of the numbers in the three shaded cells gives
the number of senior citizens who are either male or take at least one medicine or both. How-
ever, if we add the totals of the row labeled M and the column labeled A, we count 95 twice.

Table 4.8

Total

140

160

Total

210

300

> Counted twice

Figure 4.19 shows the diagram for the union of the events "male" and "take at least one med-
icine on a permanent basis."

Figure 4.19 Union of events
M and A.

Area shaded in red gives the
union of events M and A, and

includes 255 senior citizens I

4.9.2 Addition Rule

The method used to calculate the probability of the union of events is called the addition rule.
It is defined as follows.

Addition Rule to Find the Probability of Union of Events The probability of the union of two
events A and B is

P(A or B) = P(A) + P(B) - P(A and B)

Thus, to calculate the probability of the union of two events A and B, we add their marginal
probabilities and subtract their joint probability from this sum. We must subtract the joint proba-
bility of A and B from the sum of their marginal probabilities to avoid double counting because of
common outcomes in A and B. This is the case where events A and B are not mutually exclusive.

■ EXAMPLE 4-30

A university president proposed that all students must take a course in ethics as a requirement
for graduation. Three hundred faculty members and students from this university were asked
about their opinions on this issue. Table 4.9 gives a two-way classification of the responses of
these faculty members and students.

Table 4.9 Two-Way Classification of Responses

Favor

Oppose

Neutral

Total

Faculty

Student

110

230

Fotal

135

125

300

Calculating the probability
of the union of two events:
two-way table.

4.9 Union of Events and the Addition Rule

Find the probability that one person selected at random from these 300 persons is a faculty
member or is in favor of this proposal.

Solution Let us define the following events:

A = the person selected is a faculty member

B = the person selected is in favor of the proposal

From the information given in Table 4.9,

P(A) = 70/300 = .2333
P(B) = 135/300 = .4500

P{A and B) = P{A) P{B\A) = (70/300)(45/70) = .1500

Using the addition rule, we obtain

P(A or 5) = P(A) + P(B) - P(A andfi) = .2333 + .4500 - .1500 = .5333

Thus, the probability that a randomly selected person from these 300 persons is a faculty mem-
ber or is in favor of this proposal is .5333.

The probability in this example can also be calculated without using the addition rule. The total
number of persons in Table 4.9 who are either faculty members or in favor of this proposal is

45 + 15 + 10 + 90 = 160
Hence, the required probability is

P(A or 5) = 160/300 = .5333 ■

■ EXAMPLE 4-31

In a group of 2500 persons, 1400 are female, 600 are vegetarian, and 400 are female and veg-
etarian. What is the probability that a randomly selected person from this group is a male or
vegetarian?

Solution Let us define the following events:

F = the randomly selected person is a female
M = the randomly selected person is a male
V = the randomly selected person is a vegetarian
N = the randomly selected person is a non-vegetarian

From the given information, we know that of the group, 1400 are female, 600 are vegetarian, and
400 are female and vegetarian. Hence, 1100 are male, 1900 are nonvegetarian, and 200 are male
and vegetarian. We are to find the probability P(M or V). This probability is obtained as follows:

P(Mor V) = P(M) + P(V) - P(M and V)
1100 600 _ 200
~ 2500 2500 2500
= .44 + .24 - .08 = .60

Actually, using the given information, we can prepare Table 4. 10 for this example. In the table,
the numbers in the shaded cells are given to us. The remaining numbers are calculated by do-
ing some arithmetic manipulations.

Table 4.10

Two- Way Classification Table

Vegetarian (V)

Nonvegetarian (AO

Total

Female (F)

[400]

1000

1 1400 1

Male (M)

200

900

1100

Total

[Ml

1900

2500

Calculating the probability of
the union of two events.

174 Chapter 4 Probability

Using Table 4.10, we find the required probability:

P(M or V) = P(M) + P(V) - P(M and V)

1100 600 200

+ = .44 + .24 - .08 = .60

2500 2500 2500

Addition Rule for Mutually Exclusive Events

We know from an earlier discussion that the joint probability of two mutually exclusive events
is zero. When A and B are mutually exclusive events, the term P(A and B) in the addition rule
becomes zero and is dropped from the formula. Thus, the probability of the union of two mu-
tually exclusive events is given by the sum of their marginal probabilities.

Addition Rule to Find the Probability of the Union of Mutually Exclusive Events The probability
of the union of two mutually exclusive events A and B is

P(A or B) = P(A) + P(B)

Calculating the probability of
the union of two mutually
exclusive events: two-way table.

■ EXAMPLE 4-32

A university president proposed that all students must take a course in ethics as a requirement
for graduation. Three hundred faculty members and students from this university were asked
about their opinion on this issue. The following table, reproduced from Table 4.9 in Example
4-30, gives a two-way classification of the responses of these faculty members and students.

Favor

Oppose

Neutral

Total

Faculty

Student

110

230

Total

135

125

300

What is the probability that a randomly selected person from these 300 faculty members and
students is in favor of the proposal or is neutral?

Solution Let us define the following events:

F = the person selected is in favor of the proposal

N = the person selected is neutral

As shown in Figure 4.20, events F and N are mutually exclusive because a person selected
can be either in favor or neutral but not both.

Figure 4.20 Venn diagram of mutually
exclusive events.

From the given information,

Hence,

P{F) = 135/300 = .4500
P(N) = 40/300 = .1333

P(F or TV) = P(F) + P(N) = .4500 + .1333 = .5833

4.9 Union of Events and the Addition Rule 175

The addition rule formula can easily be extended to apply to more than two events. The fol-
lowing example illustrates this.

EXAMPLE 4-33

Consider the experiment of rolling a die twice. Find the probability that the sum of the num-
bers obtained on two rolls is 5, 7, or 10.

Solution The experiment of rolling a die twice has a total of 36 outcomes, which are
listed in Table 4.11. Assuming that the die is balanced, these 36 outcomes are equally
likely.

Calculating the probability of
the union of three mutually ex-
clusive events.

Table 4.11 Two Rolls of a Die

Second Roll of the Die
3 4

(1,1)

(1,2)

^TM))

(1^5^-

---TX§)

(2,1)

(2^-

^^(23)^

-""(16)

First
Roll of

(3J^

^02}^

-^(33)^-

-^04)^^

-"05)

(3,6)

the Die

(54}^

^""""(53)

-""(^6)

((<vL^

-"""(^2)

(6,3)

(6,6)

The events that give the sum of two numbers equal to 5 or 7 or 10 are shaded in the table.
As we can observe, the three events "the sum is 5," "the sum is 7," and "the sum is 10" are
mutually exclusive. Four outcomes give a sum of 5, six give a sum of 7, and three outcomes
give a sum of 10. Thus,

P(sum is 5 or 7 or 10) = P(sum is 5) + P(sum is 7) + P(sum is 10)

= 4/36 + 6/36 + 3/36 = 13/36 = .3611 ■

■ EXAMPLE 4-34

The probability that a person is in favor of genetic engineering is .55 and that a person is

against it is .45. Two persons are randomly selected, and it is observed whether they favor or Calculating the probability
oppose genetic engineering. °f the union °f three mutua ^

exclusive events.

(a) Draw a tree diagram for this experiment.

(b) Find the probability that at least one of the two persons favors genetic engineering.

Solution

(a) Let

F = a person is in favor of genetic engineering
A = a person is against genetic engineering

This experiment has four outcomes: both persons are in favor (FF), the first person is
in favor and the second is against (FA), the first person is against and the second is in
favor (AF), and both persons are against genetic engineering (AA). The tree diagram
in Figure 4.21 shows these four outcomes and their probabilities.

176 Chapter 4 Probability

First person

Second person

Final outcomes and
their probabilities

P{FF) = (.55) (.55) = .3025

P(FA) = (.55) (.45) = .2475

P(AF) = (.45) (.55) = .2475

P(AA) = (.45) (.45) = .2025

Figure 4.21 Tree diagram.

(b) The probability that at least one person favors genetic engineering is given by the
union of events FF, FA, and AF. These three outcomes are mutually exclusive. Hence,

P(at least one person favors) = P(FF or FA or AF)

= P(FF) + P(FA) + P(AF)

= .3025 + .2475 + .2475 = .7975 ■

EXERCISES

CONCEPTS AND PROCEDURES

4.100 Explain the meaning of the union of two events. Give one example.

4.101 How is the addition rule of probability for two mutually exclusive events different from the rule for
two mutually nonexclusive events?

4.102 Consider the following addition rule to find the probability of the union of two events A and B:

P(A or B) = P(A) + P(B) - P(A and B)

When and why is the term P(A and B) subtracted from the sum of P(A) and P{B)1 Give one example where
you might use this formula.

4.103 When is the following addition rule used to find the probability of the union of two events A and fi?

P(A or B) = P(A) + P(B)
Give one example where you might use this formula.

4.104 Find P(A or B) for the following.

a. P(A) = .58, P(B) = .66, and P(A and B) = .57

b. P(A) = .72, P(B) = .42, and P(A and B) = .39

4.105 Find P(A or B) for the following.

a. P(A) = .18, P(B) = .49, and P(A and S) = .11

b. P(A) = .73, P(B) = .71, and P(A and B) = .68

4.106 Given that A and B are two mutually exclusive events, find P(A or B) for the following.

a. P(A) = Al and P(B) = .32

b. P(A) = .16 and P(B) = .59

4.107 Given that A and B are two mutually exclusive events, find P(A or B) for the following,
a. P(A) = .25 and P(B) = .27

b. P(A) = .58 and P(B)

.09

4.9 Union of Events and the Addition Rule

177

■ APPLICATIONS

4.108 In a sample survey, 1800 senior citizens were asked whether or not they have ever been victimized
by a dishonest telemarketer. The following table gives the responses by age group.

Have Been
Victimized

Have Never
Been Victimized

60-69 (A)

106

698

Age

70-79 (B)

145

447

80 or over (C)

343

Suppose one person is randomly selected from these senior citizens. Find the following probabilities.

a. P(have been victimized or B)

b. P(have never been victimized or C)

4.109 The following table gives a two-way classification of all basketball players at a state university
who began their college careers between 2001 and 2005, based on gender and whether or not they
graduated.

Graduated

Did Not Graduate

Male

126

Female

133

If one of these players is selected at random, find the following probabilities.

a. P(female or did not graduate)

b. P(graduated or male)

4.110 Five hundred employees were selected from a city's large private companies, and they were asked
whether or not they have any retirement benefits provided by their companies. Based on this information,
the following two-way classification table was prepared.

Have Retirement Benefits

Yes No

Men

225 75

Women

150 50

Suppose one employee is selected at random from these 500 employees. Find the following probabilities.

a. The probability of the union of events "woman" and "yes"

b. The probability of the union of events "no" and "man"

4.111 Two thousand randomly selected adults were asked whether or not they have ever shopped on the
Internet. The following table gives a two-way classification of the responses.

Have Shopped

Have Never Shopped

Male

500

700

Female

300

500

Suppose one adult is selected at random from these 2000 adults. Find the following probabilities.

a. .P(has never shopped on the Internet or is a female)

b. P(is a male or has shopped on the Internet)

c. />(has shopped on the Internet or has never shopped on the Internet)

4.112 A consumer agency randomly selected 1700 flights for two major airlines, A and B. The follow-
ing table gives the two-way classification of these flights based on airline and arrival time. Note that "less
than 30 minutes late" includes flights that arrived early or on time.

178 Chapter 4 Probability

Less Than 30

30 Minutes to

More Than

Minutes Late

1 Hour Late

Airline A

429

390

Airline B

393

316

If one flight is selected at random from these 1700 flights, find the following probabilities.

a. /"(more than 1 hour late or airline A)

b. P(airline B or less than 30 minutes late)

c. /"(airline A or airline B)

4.113 Two thousand randomly selected adults were asked if they think they are financially better off than
their parents. The following table gives the two-way classification of the responses based on the educa-
tion levels of the persons included in the survey and whether they are financially better off, the same as,
or worse off than their parents.

Less Than

High

More Than

High School

School

High School

Better off

140

450

420

Same as

250

110

Worse off

200

300

Suppose one adult is selected at random from these 2000 adults. Find the following probabilities.

a. /"(better off or high school)

b. /"(more than high school or worse off)

c. /"(better off or worse off)

4.114 There is an area of free (but illegal) parking near an inner-city sports arena. The probability that a
car parked in this area will be ticketed by police is .35, that the car will be vandalized is .15, and that it
will be ticketed and vandalized is .10. Find the probability that a car parked in this area will be ticketed
or vandalized.

4.115 The probability that a family owns a washing machine is .68, that it owns a DVD player is .81, and
that it owns both a washing machine and a DVD player is .58. What is the probability that a randomly se-
lected family owns a washing machine or a DVD player?

4.116 Jason and Lisa are planning an outdoor reception following their wedding. They estimate that the
probability of bad weather is .25, that of a disruptive incident (a fight breaks out, the limousine is late, etc.)
is .15, and that bad weather and a disruptive incident will occur is .08. Assuming these estimates are cor-
rect, find the probability that their reception will suffer bad weather or a disruptive incident.

4.117 The probability that a randomly selected elementary or secondary school teacher from a city is a
female is .68, holds a second job is .38, and is a female and holds a second job is .29. Find the probabil-
ity that an elementary or secondary school teacher selected at random from this city is a female or holds a
second job.

4.118 According to the U.S. Census Bureau's most recent data on the marital status of the 238 million
Americans aged 15 years and older, 123.7 million are currently married and 71.5 million have never
been married. If one person from these 238 million persons is selected at random, find the probability
that this person is currently married or has never been married. Explain why this probability is not equal
to 1.0.

4.119 According to a survey of 2000 home owners, 800 of them own homes with three bedrooms, and
600 of them own homes with four bedrooms. If one home owner is selected at random from these 2000
home owners, find the probability that this home owner owns a house that has three or four bedrooms.
Explain why this probability is not equal to 1.0.

4.120 The probability of a student getting an A grade in an economics class is .24 and that of getting a
B grade is .28. What is the probability that a randomly selected student from this class will get an A or a
B in this class? Explain why this probability is not equal to 1.0.

4.121 Twenty percent of a town's voters favor letting a major discount store move into their neighbor-
hood, 63% are against it, and 17% are indifferent. What is the probability that a randomly selected voter
from this town will either be against it or be indifferent? Explain why this probability is not equal to 1.0.

Uses and Misuses 179

4.122 The probability that a corporation makes charitable contributions is .72. Two corporations are
selected at random, and it is noted whether or not they make charitable contributions.

a. Draw a tree diagram for this experiment.

b. Find the probability that at most one corporation makes charitable contributions.

4.123 The probability that an open-heart operation is successful is .84. What is the probability that in two ran-
domly selected open-heart operations at least one will be successful? Draw a tree diagram for this experiment.

USES AND MISUSES...

1. STATISTICS VERSUS PROBABILITY

At this point, you may think that probability and statistics are basically
the same things. They both use the term mean, they both report results
in terms of percentages, and so on. Do not be fooled: Although they
share many of the same mathematical tools, probability and statistics
are very different sciences. The first three chapters of the text were very
careful to specify whether a particular set of data was a population or
a sample. This is because statistics takes a sample of data and, based
upon the properties of that sample— mean, median, mode, standard
deviation— attempts to say something about a population. Probability
does exactly the opposite: In probability, we know the properties of the
population based on the sample space and the probability distribution,
and we want to make statements about a sample from the population.

Here's an example viewed from a statistical and a probabilistic
point of view. A sequence of outcomes from 10 independent coin
tosses is {H, T, H, T, H, T, T, H, T, T}. A statistician will ask the ques-
tion: Based on the observed 4 heads and 6 tails, what combination
of heads and tails would he or she expect from 100 or 1000 tosses,
and how certain would he or she be of that answer? Someone us-
ing probability will ask: If the coin toss was fair (the probability of
the event that a single coin toss be a head or tail is .5), what is the
probability that the compound event of four heads and six tails will
occur? These are substantially different questions.

The distinction between a statistical approach and a probabilis-
tic approach to a problem can be surprising. Imagine that you must
determine the average life of an automotive part. One approach would
be to take a sample of parts, test each of them until they fail to work,
and then perform some calculations regarding the distribution of fail-
ures. However, if this particular part has outliers with long life spans
(several years), you are going to be spending a lot of time in the lab-
oratory. An approach using probabilistic techniques could develop a
hypothetical life span based on the physical properties of the part, the
conditions of its use, and the manufacturing characteristics. Then you
can use your experimental results over a relatively short period of
time— including data on those parts that did not fail— to adjust your
prior understanding of what makes the part fail, saving yourself a lot
of time.

2. ODDS AND PROBABILITY

One of the first things we learn in probability is that the sum of the
probabilities of all outcomes for an experiment must equal 1.0. We
also learn about the probabilities that are developed from relative
frequencies and about subjective probabilities. In the latter case, many
of the probabilities involve personal opinions— one hopes, those of
experts in the field. Still, both scenarios (probabilities obtained from
relative frequencies and subjective probabilities) require that all

Team

Odds

Team

Odds

Arizona Diamondbacks